Message ID | alpine.DEB.2.21.1909041252230.94813@chino.kir.corp.google.com (mailing list archive) |
---|---|
Headers | show |
Series | revert immediate fallback to remote hugepages | expand |
On Wed, Sep 4, 2019 at 12:54 PM David Rientjes <rientjes@google.com> wrote: > > This series reverts those reverts and attempts to propose a more sane > default allocation strategy specifically for hugepages. Andrea > acknowledges this is likely to fix the swap storms that he originally > reported that resulted in the patches that removed __GFP_THISNODE from > hugepage allocations. There's no way we can try this for 5.3 even if looks ok. This is "let's try this during the 5.4 merge window" material, and see how it works. But I'd love affected people to test this all on their loads and post numbers, so that we have actual numbers for this series when we do try to merge it. Linus
On Wed, Sep 04, 2019 at 12:54:15PM -0700, David Rientjes wrote: > Two commits: > > commit a8282608c88e08b1782141026eab61204c1e533f > Author: Andrea Arcangeli <aarcange@redhat.com> > Date: Tue Aug 13 15:37:53 2019 -0700 > > Revert "mm, thp: restore node-local hugepage allocations" > > commit 92717d429b38e4f9f934eed7e605cc42858f1839 > Author: Andrea Arcangeli <aarcange@redhat.com> > Date: Tue Aug 13 15:37:50 2019 -0700 > > Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"" > > made their way into 5.3-rc5 > > We (mostly Linus, Andrea, and myself) have been discussing offlist how to > implement a sane default allocation strategy for hugepages on NUMA > platforms. > > With these reverts in place, the page allocator will happily allocate a > remote hugepage immediately rather than try to make a local hugepage > available. This incurs a substantial performance degradation when > memory compaction would have otherwise made a local hugepage available. > > This series reverts those reverts and attempts to propose a more sane > default allocation strategy specifically for hugepages. Andrea > acknowledges this is likely to fix the swap storms that he originally > reported that resulted in the patches that removed __GFP_THISNODE from > hugepage allocations. I sent a single comment about this patch in the private email thread you started, and I quote my answer below for full disclosure: ============ > This is an admittedly hacky solution that shouldn't cause anybody to > regress based on NUMA and the semantics of MADV_HUGEPAGE for the past > 4 1/2 years for users whose workload does fit within a socket. How can you live with the below if you can't live with 5.3-rc6? Here you allocate remote THP if the local THP allocation fails. > page = __alloc_pages_node(hpage_node, > gfp | __GFP_THISNODE, order); > + > + /* > + * If hugepage allocations are configured to always > + * synchronous compact or the vma has been madvised > + * to prefer hugepage backing, retry allowing remote > + * memory as well. > + */ > + if (!page && (gfp & __GFP_DIRECT_RECLAIM)) > + page = __alloc_pages_node(hpage_node, > + gfp | __GFP_NORETRY, order); > + You're still going to get THP allocate remote _before_ you have a chance to allocate 4k local this way. __GFP_NORETRY won't make any difference when there's THP immediately available in the remote nodes. My suggestion is to stop touching and tweaking the gfp_mask second parameter, and I suggest you to tweak the third parameter instead. If you set the order=(1<<0)|(1<<9) (instead of current order = 9), only then we'll move in the right direction and we'll get something that can work better than 5.3-rc6 no matter what which workload you throw at it. > + if (order >= pageblock_order && (gfp_mask & __GFP_IO)) { > + /* > + * If allocating entire pageblock(s) and compaction > + * failed because all zones are below low watermarks > + * or is prohibited because it recently failed at this > + * order, fail immediately. > + * > + * Reclaim is > + * - potentially very expensive because zones are far > + * below their low watermarks or this is part of very > + * bursty high order allocations, > + * - not guaranteed to help because isolate_freepages() > + * may not iterate over freed pages as part of its > + * linear scan, and > + * - unlikely to make entire pageblocks free on its > + * own. > + */ > + if (compact_result == COMPACT_SKIPPED || > + compact_result == COMPACT_DEFERRED) > + goto nopage; > + } > + Overall this patch would solve the swap storms and it's similar to __GFP_COMPACTONLY but it also allows to allocate THP in the remote nodes if remote THP are immediately available in the buddy (despite there may also be local 4k memory immediately available in the buddy). So it's just better than just __GFP_COMPACTONLY but it still cripples compaction a bit and it won't really work a whole lot different than 5.3-rc6 in terms of prioritizing local 4k over remote THP. So it's not clear why you can't live with 5.3-rc6 if you can live with the above that will provide you no more guarantees than 5.3-rc6 to get local 4k before remote THP. Thanks, Andrea ============== > The immediate goal is to return 5.3 to the behavior the kernel has > implemented over the past several years so that remote hugepages are > not immediately allocated when local hugepages could have been made > available because the increased access latency is untenable. Remote hugepages are immediately allocated before local 4k pages with that patch you sent. > + page = __alloc_pages_node(hpage_node, > + gfp | __GFP_NORETRY, order); This doesn't prevent to allocate remote THP if they are immediately available. > Merging these reverts late in the rc cycle to change behavior that has > existed for years and is known (and acknowledged) to create performance > degradations when local hugepages could have been made available serves > no purpose other than to make the development of a sane default policy > much more difficult under time pressure and to accelerate decisions that > will affect how userspace is written (and how it has regressed) that > otherwise require carefully crafted and detailed implementations. Your change is calling alloc_pages first with __GFP_THISNODE in a __GFP_COMPACTONLY way, and then it falls back immediately to allocate 2M on all nodes, including the local node for a second time, before allocating local 4k (with a magical __GFP_NORETRY which won't make a difference anyway if the 2m pages are immediately available). > Thus, this patch series returns 5.3 to the long-standing allocation > strategy that Linux has had for years and proposes to follow-up changes > that can be discussed that Andrea acknowledges will avoid the swap storms > that initially triggered this discussion in the first place. I said one good thing about this patch series, that it fixes the swap storms. But upstream 5.3 fixes the swap storms too and what you sent is not nearly equivalent to the mempolicy that Michal was willing to provide you and that we thought you needed to get bigger guarantees of getting only local 2m or local 4k pages. In fact we could weaken the aggressiveness of the proposed mempolicy of an order of magnitude if you can live with the above and the above performs well for you. If you could post an open source reproducer of your proprietary binary that got regressed by 5.3, we could help finding a solution. Instead it's not clear how you can live with this patch series, but not with 5.3 and also why you insist not making this new allocation policy an opt-in mempolicy. Furthermore no generic benchmark has been run on this series to be sure it won't regress performance for common workloads. So it's certainly more risky than the current 5.3 status which matches what's running in production on most enterprise distro. I thought you clarified that the page fault latency was not the primary reason you had __GFP_THISNODE there. If instead you've still got a latency issue in the 2M page fault and that was the real reason of __GFP_THISNODE, why don't you lower the latency of compaction in remote nodes without having to call alloc_pages 3 times? I mean I proposed to call alloc_pages just once instead of the current twice, with your patch we'd be calling alloc_pages three times, with a repetition attempt even on the 2m page size on the local node. Obviously this "hacky solution" is better than 5.3-rc4 and all previous because it at least doesn't create swap storms. What's not clear is how this is going to work better than upstream for you. What I think will happen is that this will work similarly to __GFP_COMPACTONLY and it'll will weaken THP utilization ratio for MADV_HUGEPAGE users and it's not tested to be sure it won't perform worse in other conditions. Thanks, Andrea
On Wed, 4 Sep 2019, Linus Torvalds wrote: > > This series reverts those reverts and attempts to propose a more sane > > default allocation strategy specifically for hugepages. Andrea > > acknowledges this is likely to fix the swap storms that he originally > > reported that resulted in the patches that removed __GFP_THISNODE from > > hugepage allocations. > > There's no way we can try this for 5.3 even if looks ok. This is > "let's try this during the 5.4 merge window" material, and see how it > works. > > But I'd love affected people to test this all on their loads and post > numbers, so that we have actual numbers for this series when we do try > to merge it. > I'm certainly not proposing the last two patches in the series marked as RFC to be merged. I'm proposing the first two patches in the series, reverts of the reverts that went into 5.3-rc5, are merged for 5.3 so that we return to the same behavior that we have had for years and semantics that MADV_HUGEPAGE has provided that entire libraries and userspaces have been based on. It is very clear that there is a path forward here to address the *bug* that Andrea is reporting: it has become conflated with NUMA allocation policies which is not at all the issue. Note that if 5.3 is released with these patches that it requires a very specialized usecase to benefit from: workloads that are larger than one socket and *requires* remote memory not being low on memory or fragmented. If remote memory is as low on memory or fragmented as local memory (like in a datacenter), the reverts that went into 5.3 will double the impact of the very bug being reported because now it's causing swap storms for remote memory as well. I don't anticipate we'll get numbers for that since it's not a configuration they run in. The bug here is reclaim in the page allocator that does not benefit memory compaction because we are failing per-zone watermarks already. The last two patches in these series avoid that, which is a sane default page allocation policy, and the allow fallback to remote memory only when we can't easily allocate locally. We *need* the ability to allocate hugepages locally if compaction can work, anything else kills performance. 5.3-rc7 won't try that, it will simply fallback to remote memory. We need to try compaction but we do not want to reclaim if failing watermark checks. I hope that I'm not being unrealistically optimistic that we can make progress on providing a sane default allocation policy using those last two patches as a starter for 5.4, but I'm strongly suggesting that you take the first two patches to return us to the policy that has existed for years and not allow MADV_HUGEPAGE to be used for immediate remote allocation when local is possible.
On Wed, 4 Sep 2019, Andrea Arcangeli wrote: > > This is an admittedly hacky solution that shouldn't cause anybody to > > regress based on NUMA and the semantics of MADV_HUGEPAGE for the past > > 4 1/2 years for users whose workload does fit within a socket. > > How can you live with the below if you can't live with 5.3-rc6? Here > you allocate remote THP if the local THP allocation fails. > > > page = __alloc_pages_node(hpage_node, > > gfp | __GFP_THISNODE, order); > > + > > + /* > > + * If hugepage allocations are configured to always > > + * synchronous compact or the vma has been madvised > > + * to prefer hugepage backing, retry allowing remote > > + * memory as well. > > + */ > > + if (!page && (gfp & __GFP_DIRECT_RECLAIM)) > > + page = __alloc_pages_node(hpage_node, > > + gfp | __GFP_NORETRY, order); > > + > > You're still going to get THP allocate remote _before_ you have a > chance to allocate 4k local this way. __GFP_NORETRY won't make any > difference when there's THP immediately available in the remote nodes. > This is incorrect: the fallback allocation here is only if the initial allocation with __GFP_THISNODE fails. In that case, we were able to compact memory to make a local hugepage available without incurring excessive swap based on the RFC patch that appears as patch 3 in this series. I very much believe your usecase would benefit from this as well (or at least not cause others to regress). We *want* remote thp if they are immediately available but only after we have tried to allocate locally from the initial allocation and allowed memory compaction fail first. Likely there can be discussion around the fourth patch of this series to get exactly the right policy. We can construct it as necessary for hugetlbfs to not have any change in behavior, that's simple. We could also check per-zone watermarks in mm/huge_memory.c to determine if local memory is low-on-memory and, if so, allow remote allocation. In that case it's certainly better to allocate remotely when we'd be reclaiming locally even for fallback native pages. > I said one good thing about this patch series, that it fixes the swap > storms. But upstream 5.3 fixes the swap storms too and what you sent > is not nearly equivalent to the mempolicy that Michal was willing > to provide you and that we thought you needed to get bigger guarantees > of getting only local 2m or local 4k pages. > I haven't seen such a patch series, is there a link?
Is there any objection from anybody to applying the first two patches, the reverts of the reverts that went into 5.3-rc5, for 5.3 and pursuing discussion and development using the last two patches in this series as a starting point for a sane allocation policy that just works by default for everybody? Andrea acknowledges the swap storm that he reported would be fixed with the last two patches in this series and there has been discussion on how they can be extended at the same time that they do not impact allocations outside the scope of the discussion here (hugetlb). On Thu, 5 Sep 2019, David Rientjes wrote: > On Wed, 4 Sep 2019, Linus Torvalds wrote: > > > > This series reverts those reverts and attempts to propose a more sane > > > default allocation strategy specifically for hugepages. Andrea > > > acknowledges this is likely to fix the swap storms that he originally > > > reported that resulted in the patches that removed __GFP_THISNODE from > > > hugepage allocations. > > > > There's no way we can try this for 5.3 even if looks ok. This is > > "let's try this during the 5.4 merge window" material, and see how it > > works. > > > > But I'd love affected people to test this all on their loads and post > > numbers, so that we have actual numbers for this series when we do try > > to merge it. > > > > I'm certainly not proposing the last two patches in the series marked as > RFC to be merged. I'm proposing the first two patches in the series, > reverts of the reverts that went into 5.3-rc5, are merged for 5.3 so that > we return to the same behavior that we have had for years and semantics > that MADV_HUGEPAGE has provided that entire libraries and userspaces have > been based on. > > It is very clear that there is a path forward here to address the *bug* > that Andrea is reporting: it has become conflated with NUMA allocation > policies which is not at all the issue. Note that if 5.3 is released with > these patches that it requires a very specialized usecase to benefit from: > workloads that are larger than one socket and *requires* remote memory not > being low on memory or fragmented. If remote memory is as low on memory > or fragmented as local memory (like in a datacenter), the reverts that > went into 5.3 will double the impact of the very bug being reported > because now it's causing swap storms for remote memory as well. I don't > anticipate we'll get numbers for that since it's not a configuration they > run in. > > The bug here is reclaim in the page allocator that does not benefit memory > compaction because we are failing per-zone watermarks already. The last > two patches in these series avoid that, which is a sane default page > allocation policy, and the allow fallback to remote memory only when we > can't easily allocate locally. > > We *need* the ability to allocate hugepages locally if compaction can > work, anything else kills performance. 5.3-rc7 won't try that, it will > simply fallback to remote memory. We need to try compaction but we do not > want to reclaim if failing watermark checks. > > I hope that I'm not being unrealistically optimistic that we can make > progress on providing a sane default allocation policy using those last > two patches as a starter for 5.4, but I'm strongly suggesting that you > take the first two patches to return us to the policy that has existed for > years and not allow MADV_HUGEPAGE to be used for immediate remote > allocation when local is possible. >
On Sat, Sep 7, 2019 at 12:51 PM David Rientjes <rientjes@google.com> wrote: > > Andrea acknowledges the swap storm that he reported would be fixed with > the last two patches in this series The problem is that even you aren't arguing that those patches should go into 5.3. So those fixes aren't going in, so "the swap storms would be fixed" argument isn't actually an argument at all as far as 5.3 is concerned. End result: we'd have the qemu-kvm instance performance problem in 5.3 that apparently causes distros to apply those patches that you want to revert anyway. So reverting would just make distros not use 5.3 in that form. So I don't think we can revert those again without applying the two patches in the series. So it would be a 5.4 merge window operation. Linus
On Sat, 7 Sep 2019, Linus Torvalds wrote: > > Andrea acknowledges the swap storm that he reported would be fixed with > > the last two patches in this series > > The problem is that even you aren't arguing that those patches should > go into 5.3. > For three reasons: (a) we lack a test result from Andrea, (b) there's on-going discussion, particularly based on Vlastimil's feedback, and (c) the patches will be refreshed incorporating that feedback as well as Mike's suggestion to exempt __GFP_RETRY_MAYFAIL for hugetlb. > So those fixes aren't going in, so "the swap storms would be fixed" > argument isn't actually an argument at all as far as 5.3 is concerned. > It indicates that progress has been made to address the actual bug without introducing long-lived access latency regressions for others, particularly those who use MADV_HUGEPAGE. In the worst case, some systems running 5.3-rc4 and 5.3-rc5 have the same amount of memory backed by hugepages but on 5.3-rc5 the vast majority of it is allocated remotely. This incurs a signficant performance regression regardless of platform; the only thing needed to induce this is a fragmented local node that would otherwise be compacted in 5.3-rc4 rather than quickly allocate remote on 5.3-rc5. > End result: we'd have the qemu-kvm instance performance problem in 5.3 > that apparently causes distros to apply those patches that you want to > revert anyway. > > So reverting would just make distros not use 5.3 in that form. > I'm arguing to revert 5.3 back to the behavior that we have had for years and actually fix the bug that everybody else seems to be ignoring and then *backport* those fixes to 5.3 stable and every other stable tree that can use them. Introducing a new mempolicy for NUMA locality into 5.3.0 that will subsequently changed in future 5.3 stable kernels and differs from all kernels from the past few years is not in anybody's best interest if the actual problem can be fixed. It requires more feedback than a one-line "the swap storms would be fixed with this." That collaboration takes time and isn't something that should be rushed into 5.3-rc5. Yes, we can fix NUMA locality of hugepages when a workload like qemu is larger than a single socket; the vast majority of workloads in the datacenter are small than a socket and *cannot* incur the performance penalty if local memory is fragmented that 5.3-rc5 introduces. In other words, 5.3-rc5 is only fixing a highly specialized usecase where remote allocation is acceptable because the workload is larger than a socket *and* remote memory is not low on memory or fragmented. If you consider the opposite of that, workloads smaller than a socket or local compaction actually works, this has introduced a measurable regression for everybody else. I'm not sure why we are ignoring a painfully obvious bug in the page allocator because of a poor feedback loop between itself and memory compaction and rather papering over it by falling back to remote memory when NUMA actually does matter. If you release 5.3 without the first two patches in this series, I wouldn't expect any additional feedback or test results to fix this bug considering all we have gotten so far is "this would fix this swap storms" and not collaborating to fix the issue for everybody rather than only caring about their own workloads. At least my patches acknowledge and try to fix the issue the other is encountering.
On 9/8/19 3:50 AM, David Rientjes wrote: > On Sat, 7 Sep 2019, Linus Torvalds wrote: > >>> Andrea acknowledges the swap storm that he reported would be fixed with >>> the last two patches in this series >> >> The problem is that even you aren't arguing that those patches should >> go into 5.3. >> > > For three reasons: (a) we lack a test result from Andrea, That's argument against the rfc patches 3+4s, no? But not for including the reverts of reverts of reverts (patches 1+2). > (b) there's > on-going discussion, particularly based on Vlastimil's feedback, and I doubt this will be finished and tested with reasonable confidence even for the 5.4 merge window. > (c) the patches will be refreshed incorporating that feedback as well as > Mike's suggestion to exempt __GFP_RETRY_MAYFAIL for hugetlb. There might be other unexpected consequences (even if hugetlb wasn't such an issue as I suspected, in the end). >> So those fixes aren't going in, so "the swap storms would be fixed" >> argument isn't actually an argument at all as far as 5.3 is concerned. >> > > It indicates that progress has been made to address the actual bug without > introducing long-lived access latency regressions for others, particularly > those who use MADV_HUGEPAGE. In the worst case, some systems running > 5.3-rc4 and 5.3-rc5 have the same amount of memory backed by hugepages but > on 5.3-rc5 the vast majority of it is allocated remotely. This incurs a It's been said before, but such sensitive code generally relies on mempolicies or node reclaim mode, not THP __GFP_THISNODE implementation details. Or if you know there's enough free memory and just needs to be compacted, you could do it once via sysfs before starting up your workload. > signficant performance regression regardless of platform; the only thing > needed to induce this is a fragmented local node that would otherwise be > compacted in 5.3-rc4 rather than quickly allocate remote on 5.3-rc5. > >> End result: we'd have the qemu-kvm instance performance problem in 5.3 >> that apparently causes distros to apply those patches that you want to >> revert anyway. >> >> So reverting would just make distros not use 5.3 in that form. >> > > I'm arguing to revert 5.3 back to the behavior that we have had for years > and actually fix the bug that everybody else seems to be ignoring and then > *backport* those fixes to 5.3 stable and every other stable tree that can > use them. Introducing a new mempolicy for NUMA locality into 5.3.0 that I think it's rather removing the problematic implicit mempolicy of __GFP_THISNODE. > will subsequently changed in future 5.3 stable kernels and differs from > all kernels from the past few years is not in anybody's best interest if > the actual problem can be fixed. It requires more feedback than a > one-line "the swap storms would be fixed with this." That collaboration > takes time and isn't something that should be rushed into 5.3-rc5. > > Yes, we can fix NUMA locality of hugepages when a workload like qemu is > larger than a single socket; the vast majority of workloads in the > datacenter are small than a socket and *cannot* incur the performance > penalty if local memory is fragmented that 5.3-rc5 introduces. > > In other words, 5.3-rc5 is only fixing a highly specialized usecase where > remote allocation is acceptable because the workload is larger than a > socket *and* remote memory is not low on memory or fragmented. If you Clearly we disagree here which is the highly specialized usecase that might get slower remote memory access, and which is more common workload that will suffer from swap storms. No point arguing it further, but several distros made the choice by carrying Andrea's patches already. > consider the opposite of that, workloads smaller than a socket or local > compaction actually works, this has introduced a measurable regression for > everybody else. > > I'm not sure why we are ignoring a painfully obvious bug in the page > allocator because of a poor feedback loop between itself and memory > compaction and rather papering over it by falling back to remote memory > when NUMA actually does matter. If you release 5.3 without the first two > patches in this series, I wouldn't expect any additional feedback or test > results to fix this bug considering all we have gotten so far is "this > would fix this swap storms" and not collaborating to fix the issue for > everybody rather than only caring about their own workloads. At least my > patches acknowledge and try to fix the issue the other is encountering. I might have missed something, but you were asked for a reproducer of your use case so others can develop patches with it in mind? Mel did provide a simple example that shows the swap storms very easily.
On Sun, 8 Sep 2019, Vlastimil Babka wrote: > > On Sat, 7 Sep 2019, Linus Torvalds wrote: > > > >>> Andrea acknowledges the swap storm that he reported would be fixed with > >>> the last two patches in this series > >> > >> The problem is that even you aren't arguing that those patches should > >> go into 5.3. > >> > > > > For three reasons: (a) we lack a test result from Andrea, > > That's argument against the rfc patches 3+4s, no? But not for including > the reverts of reverts of reverts (patches 1+2). > Yes, thanks: I would strongly prefer not to propose rfc patches 3-4 without a testing result from Andrea and collaboration to fix the underlying issue. My suggestion to Linus is to merge patches 1-2 so we don't have additional semantics for MADV_HUGEPAGE or thp enabled=always configs based on kernel version, especially since they are already conflated. > > (b) there's > > on-going discussion, particularly based on Vlastimil's feedback, and > > I doubt this will be finished and tested with reasonable confidence even > for the 5.4 merge window. > Depends, but I probably suspect the same. If the reverts to 5.3 are not applied, then I'm not at all confident that forward progress on this issue will be made: my suggestion about changes to the page allocator when the patches were initially proposed went unresponded to, as did the ping on those suggestions, and now we have a simplistic "this will fix the swap storms" but no active involvement from Andrea to improve this; he likely is quite content on lumping NUMA policy onto an already overloaded madvise mode. [ NOTE! The rest of this email and my responses are about how to address the default page allocation behavior which we can continue to discuss but I'd prefer it separated from the discussion of reverts for 5.3 which needs to be done to not conflate madvise modes with mempolicies for a subset of kernel versions. ] > > It indicates that progress has been made to address the actual bug without > > introducing long-lived access latency regressions for others, particularly > > those who use MADV_HUGEPAGE. In the worst case, some systems running > > 5.3-rc4 and 5.3-rc5 have the same amount of memory backed by hugepages but > > on 5.3-rc5 the vast majority of it is allocated remotely. This incurs a > > It's been said before, but such sensitive code generally relies on > mempolicies or node reclaim mode, not THP __GFP_THISNODE implementation > details. Or if you know there's enough free memory and just needs to be > compacted, you could do it once via sysfs before starting up your workload. > This entire discussion is based on the long standing and default behavior of page allocation for transparent hugepages. Your suggestions are not possible for two reasons: (1) I cannot enforce a mempolicy of MPOL_BIND because this doesn't allow fallback at all and would oom kill if the local node is oom, and (2) node reclaim mode is a system-wide setting so all workloads are affected for every page allocation, not only users of MADV_HUGEPAGE who specifically opt-in to expensive allocation. We could make the argument that Andrea's qemu usecase could simply use MPOL_PREFERRED for memory that should be faulted remotely which would provide more control and would work for all versions of Linux regardless of MADV_HUGEPAGE or not; that's a much more simple workaround than conflating MADV_HUGEPAGE for NUMA locality, asking users who are adversely affected by 5.3 to create new mempolicies to work around something that has always worked fine, or asking users to tune page allocator policies with sysctls. > > I'm arguing to revert 5.3 back to the behavior that we have had for years > > and actually fix the bug that everybody else seems to be ignoring and then > > *backport* those fixes to 5.3 stable and every other stable tree that can > > use them. Introducing a new mempolicy for NUMA locality into 5.3.0 that > > I think it's rather removing the problematic implicit mempolicy of > __GFP_THISNODE. > I'm referring to a solution that is backwards compatible for existing users which 5.3 is certainly not. > I might have missed something, but you were asked for a reproducer of > your use case so others can develop patches with it in mind? Mel did > provide a simple example that shows the swap storms very easily. > Are you asking for a synthetic kernel module that you can inject to induce fragmentation on a local node where memory compaction would be possible and then a userspace program that uses MADV_HUGEPAGE and fits within that node? The regression I'm reporting is for workloads that fit within a socket, it requires local fragmentation to show a regression. For the qemu case, it's quite easy to fill a local node and require additional hugepage allocations with MADV_HUGEPAGE in a test case, but for without synthetically inducing fragmentation I cannot provide a testcase that will show performance regression because memory is quickly faulted remotely rather than compacting locally.
On Sun 08-09-19 13:45:13, David Rientjes wrote: > If the reverts to 5.3 are not > applied, then I'm not at all confident that forward progress on this issue > will be made: David, could you stop this finally? I think there is a good consensus that the current (even after reverts) behavior is not going all the way down where we want to get. There have been different ways forward suggested to not fallback to remote nodes too easily, not to mention a specialized memory policy to explicitly request the behavior you presumably need (and as a bonus it wouldn't be THP specific which is even better). You seem to be deadlocked in "we've used to do something for 4 years so we must preserve that behavior". All that based on a single and odd workload which you are hand waving about without anything for the rest of the community to reproduce. Please try to get out of the argumentation loop. We are more likely to make a forward progress. 5.3 managed to fix the worst case behavior, now let's talk about more clever tuning. You cannot expect such a tuning is an overnight work. This area is full of subtle side effects and few liners might have hard to predict consequences.
On Thu 05-09-19 14:06:28, David Rientjes wrote: > On Wed, 4 Sep 2019, Andrea Arcangeli wrote: > > > > This is an admittedly hacky solution that shouldn't cause anybody to > > > regress based on NUMA and the semantics of MADV_HUGEPAGE for the past > > > 4 1/2 years for users whose workload does fit within a socket. > > > > How can you live with the below if you can't live with 5.3-rc6? Here > > you allocate remote THP if the local THP allocation fails. > > > > > page = __alloc_pages_node(hpage_node, > > > gfp | __GFP_THISNODE, order); > > > + > > > + /* > > > + * If hugepage allocations are configured to always > > > + * synchronous compact or the vma has been madvised > > > + * to prefer hugepage backing, retry allowing remote > > > + * memory as well. > > > + */ > > > + if (!page && (gfp & __GFP_DIRECT_RECLAIM)) > > > + page = __alloc_pages_node(hpage_node, > > > + gfp | __GFP_NORETRY, order); > > > + > > > > You're still going to get THP allocate remote _before_ you have a > > chance to allocate 4k local this way. __GFP_NORETRY won't make any > > difference when there's THP immediately available in the remote nodes. > > > > This is incorrect: the fallback allocation here is only if the initial > allocation with __GFP_THISNODE fails. In that case, we were able to > compact memory to make a local hugepage available without incurring > excessive swap based on the RFC patch that appears as patch 3 in this > series. That patch is quite obscure and specific to pageblock_order+ sizes and for some reason requires __GPF_IO without any explanation on why. The problem is not THP specific, right? Any other high order has the same problem AFAICS. So it is just a hack and that's why it is hard to reason about. I believe it would be the best to start by explaining why we do not see the same problem with order-0 requests. We do not enter the slow path and thus the memory reclaim if there is any other node to pass through watermakr as well right? So essentially we are relying on kswapd to keep nodes balanced so that allocation request can be satisfied from a local node. We do have kcompactd to do background compaction. Why do we want to rely on the direct compaction instead? What is the fundamental difference? Your changelog goes in length about some problems in the compaction but I really do not see the underlying problem description. We cannot do any sensible fix/heuristic without capturing that IMHO. Either there is some fundamental difference between direct and background compaction and doing a the former one is necessary and we should be doing that by default for all higher order requests that are sleepable (aka __GFP_DIRECT_RECLAIM) or there is something to fix for the background compaction to be more pro-active. > > I said one good thing about this patch series, that it fixes the swap > > storms. But upstream 5.3 fixes the swap storms too and what you sent > > is not nearly equivalent to the mempolicy that Michal was willing > > to provide you and that we thought you needed to get bigger guarantees > > of getting only local 2m or local 4k pages. > > > > I haven't seen such a patch series, is there a link? not yet unfortunatelly. So far I haven't heard that you are even interested in that policy. You have never commented on that IIRC.
Let me revive this thread as there was no follow up. On Mon 09-09-19 21:30:20, Michal Hocko wrote: [...] > I believe it would be the best to start by explaining why we do not see > the same problem with order-0 requests. We do not enter the slow path > and thus the memory reclaim if there is any other node to pass through > watermakr as well right? So essentially we are relying on kswapd to keep > nodes balanced so that allocation request can be satisfied from a local > node. We do have kcompactd to do background compaction. Why do we want > to rely on the direct compaction instead? What is the fundamental > difference? I am especially interested about this part. The more I think about this the more I am convinced that the underlying problem really is in the pre mature fallback in the fast path. Does the almost-patch below helps your workload? It effectively reduces the fast path for higher order allocations to the local/requested node. The justification is that watermark check might be too strict for those requests as it is primary order-0 oriented. Low watermark target simply has no meaning for the higher order requests AFAIU. The min-low gap is giving kswapd a chance to balance and be more local node friendly while we do not have anything like that in compaction. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ff5484fdbdf9..09036cf55fca 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4685,7 +4685,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, { struct page *page; unsigned int alloc_flags = ALLOC_WMARK_LOW; - gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */ + gfp_t fastpath_mask, alloc_mask; /* The gfp_t that was actually used for allocation */ struct alloc_context ac = { }; /* @@ -4698,7 +4698,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, } gfp_mask &= gfp_allowed_mask; - alloc_mask = gfp_mask; + fastpath_mask = alloc_mask = gfp_mask; if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags)) return NULL; @@ -4710,8 +4710,17 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, */ alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp_mask); - /* First allocation attempt */ - page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac); + /* + * First allocation attempt. If we have a high order allocation then do not fall + * back to a remote node just based on the watermark check on the requested node + * because compaction might easily free up a requested order and then it would be + * better to simply go to the slow path. + * TODO: kcompactd should help here but nobody has woken it up unless we hit the + * slow path so we might need some tuning there as well. + */ + if (order && (gfp_mask & __GFP_DIRECT_RECLAIM)) + fastpath_mask |= __GFP_THISNODE; + page = get_page_from_freelist(fastpath_mask, order, alloc_flags, &ac); if (likely(page)) goto out;
On Wed, 25 Sep 2019, Michal Hocko wrote: > I am especially interested about this part. The more I think about this > the more I am convinced that the underlying problem really is in the pre > mature fallback in the fast path. I appreciate you taking the time to continue to look at this but I'm confused about the underlying problem you're referring to: we had no underlying problem until 5.3 was released so we need to carry patches that will revert this behavior (we simply can't tolerate double digit memory access latency regressions lol). If you're referring to post-5.3 behavior, this appears to override alloc_hugepage_direct_gfpmask() but directly in the page allocator. static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr) { ... /* * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not * specified, to express a general desire to stay on the current Your patch is setting __GFP_THISNODE for __GFP_DIRECT_RECLAIM: this allocation will fail in the fastpath for both my case (fragmented local node) and Andrea's case (out of memory local node). The first get_page_from_freelist() will then succeed in the slowpath for both cases; compaction is not tried for either. In my case, that results in a perpetual remote access latency that we can't tolerate. If Andrea's remote nodes are fragmented or low on memory, his case encounters swap storms over both the local node and remote nodes. So I'm not really sure what is solved by your patch? We're not on 5.3, but I can try it and collect data on exactly how poorly it performs on fragmented *hosts* (not single nodes, but really the whole system) because swap storms were never fixed here, it was only papered over. > Does the almost-patch below helps your > workload? It effectively reduces the fast path for higher order > allocations to the local/requested node. The justification is that > watermark check might be too strict for those requests as it is primary > order-0 oriented. Low watermark target simply has no meaning for the > higher order requests AFAIU. The min-low gap is giving kswapd a chance > to balance and be more local node friendly while we do not have anything > like that in compaction. > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index ff5484fdbdf9..09036cf55fca 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4685,7 +4685,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, > { > struct page *page; > unsigned int alloc_flags = ALLOC_WMARK_LOW; > - gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */ > + gfp_t fastpath_mask, alloc_mask; /* The gfp_t that was actually used for allocation */ > struct alloc_context ac = { }; > > /* > @@ -4698,7 +4698,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, > } > > gfp_mask &= gfp_allowed_mask; > - alloc_mask = gfp_mask; > + fastpath_mask = alloc_mask = gfp_mask; > if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags)) > return NULL; > > @@ -4710,8 +4710,17 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, > */ > alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp_mask); > > - /* First allocation attempt */ > - page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac); > + /* > + * First allocation attempt. If we have a high order allocation then do not fall > + * back to a remote node just based on the watermark check on the requested node > + * because compaction might easily free up a requested order and then it would be > + * better to simply go to the slow path. > + * TODO: kcompactd should help here but nobody has woken it up unless we hit the > + * slow path so we might need some tuning there as well. > + */ > + if (order && (gfp_mask & __GFP_DIRECT_RECLAIM)) > + fastpath_mask |= __GFP_THISNODE; > + page = get_page_from_freelist(fastpath_mask, order, alloc_flags, &ac); > if (likely(page)) > goto out;
On Thu 26-09-19 12:03:37, David Rientjes wrote: [...] > Your patch is setting __GFP_THISNODE for __GFP_DIRECT_RECLAIM: this > allocation will fail in the fastpath for both my case (fragmented local > node) and Andrea's case (out of memory local node). The first > get_page_from_freelist() will then succeed in the slowpath for both cases; > compaction is not tried for either. > > In my case, that results in a perpetual remote access latency that we > can't tolerate. If Andrea's remote nodes are fragmented or low on memory, > his case encounters swap storms over both the local node and remote nodes. > > So I'm not really sure what is solved by your patch? There are two aspects the patch is targeting at. The first is that the fast path is targeting a higher watermak (WMARK_LOW) so it might fallback to a remote node easier and then the fast path doesn't wake up kcompactd so there is no pro-active compaction going on to help future allocations. You are right that a fragmented or at min watermark node would fallback to a remote node even with this patch. I wanted to see how much the kcompactd can change the overall picture. If this is not sufficient then maybe we need to drop the first optimistic attempt as well and simply go right into the light compaction. Something like this on top diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ff5484fdbdf9..61284e7f01ee 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4434,7 +4434,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, * The adjusted alloc_flags might result in immediate success, so try * that first */ - page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac); + if (!order) + page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac); if (page) goto got_pg; The whole point of handling this in the page allocator directly is to have a unified solutions rather than have each specific caller invent its own way to achieve higher locality.
On Fri, Sep 27, 2019 at 12:48 AM Michal Hocko <mhocko@kernel.org> wrote: > > - page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac); > + if (!order) > + page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac); > if (page) > goto got_pg; > > The whole point of handling this in the page allocator directly is to > have a unified solutions rather than have each specific caller invent > its own way to achieve higher locality. The above just looks hacky. Why would order-0 be special? It really is hugepages that are special - small orders don't generally need a lot of compaction. and this secondary get_page_from_freelist() is not primarily about compaction, it's about relaxing some of the other constraints, and the actual alloc_flags have changed from the initial ones. I really think that you're missing the big picture. We want node locality, and we want big pages, but those "we want" simply shouldn't be as black-and-white as they sometimes are today. In fact, the whole black-and-white thing is completely crazy for anything but some test-load. I will just do the reverts and apply David's two patches on top. We now have the time to actually test the behavior, and we're not just before a release. The thing is, David has numbers, and the patches make _sense_. There's a description of what they do, there are comments, but the *code* makes sense too. Much more sense than the above kind of insane hack. David's patches literally do two things: - [3/4] just admit that the allocation failed when you're trying to allocate a huge-page and compaction wasn't successful - [4/4] when that huge-page allocation failed, retry on another node when appropriate That's _literally_ what David's two patches do. The above is purely the English translation of the patches. And I claim that the English translation also ends up being _sensible_. I go look at those two statements, and my reaction "yeah, that makes sense". So there were numbers, there was "this is sensible", and there were no indications that those sensible choices would actually be problematic. Nobody has actually argued against the patches making sense. Nobody has even argued that the patches would be wrong. The _only_ argument against them were literally "what if this changes something subtle", together with patches like the above that do _not_ make sense either on a big picture level or even locally on a small level. The reason I didn't apply those patches for 5.3 was that they came in very late, and there were horrendous numbers for the 5.2 behavior that caused those two big reverts. But the patches made sense back then, the timing for them just didn't. I was hoping this would just get done in the mm tree, but clearly it isn't, and I'll just do my own mm branch and merge the patches that way. Linus
On Sat 28-09-19 13:59:26, Linus Torvalds wrote: > On Fri, Sep 27, 2019 at 12:48 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > - page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac); > > + if (!order) > > + page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac); > > if (page) > > goto got_pg; > > > > The whole point of handling this in the page allocator directly is to > > have a unified solutions rather than have each specific caller invent > > its own way to achieve higher locality. > > The above just looks hacky. It is and it was meant to help move on when debugging rather than a final solution. > Why would order-0 be special? Ideally it wouldn't be but the current implementation makes it special. Why? Because the whole concept of low wmark fast path attempt is based on kswapd balancing for a high watermark providing some space. Kcompactd doesn't have any notion like that. And I believe that a large part of the problem really is there. If I am wrong here then I would appreciate to be corrected. If __GFP_THISNODE allows for a better THP utilization on a local node then the problem points at kcompactd not being pro-active enough. And that was the first diff aiming at. I also claim that this is not a THP specific problem. You are right that lower orders are less likely to hit the problem because the memory is usually not fragmented that heavily but fundamentally the over eager fallback in the fast path is still there. And that is the reason for me to pushback against __GFP_THIS_NODE && fallback allocation opencoded outside of the allocator. The allocator knows the context can compact so why should we require the caller to be doing that? Do not get me wrong, but we have a quite a long history of fine tuning for THP by adding kludges here and there and they usually turnout to break something else. I really want to get to understand the underlying problem and base a solution on it rather than "__GFP_THISNODE can cause overreclaim so pick up a break out condition and hope for the best".
On Mon 30-09-19 13:28:17, Michal Hocko wrote: [...] > Do not get me wrong, but we have a quite a long history of fine tuning > for THP by adding kludges here and there and they usually turnout to > break something else. I really want to get to understand the underlying > problem and base a solution on it rather than "__GFP_THISNODE can cause > overreclaim so pick up a break out condition and hope for the best". I didn't really get to any testing earlier but I was really suspecting that hugetlb will be the first one affected because it uses __GFP_RETRY_MAYFAIL - aka it really wants to succeed as much as possible because this is a direct admin request to preallocate a specific number of huge pages. The patch 3 doesn't really make any difference for those requests. Here is a very simple test I've done. Single NUMA node with 1G of memory. Populate the memory with a clean page cache. That is both easy to compact and reclaim and then try to allocate 512M worth of hugetlb pages. root@test1:~# cat hugetlb_test.sh #!/bin/sh set -x echo 0 > /proc/sys/vm/nr_hugepages echo 3 > /proc/sys/vm/drop_caches echo 1 > /proc/sys/vm/compact_memory dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10)) TS=$(date +%s) cp /proc/vmstat vmstat.$TS.before echo 256 > /proc/sys/vm/nr_hugepages cat /proc/sys/vm/nr_hugepages cp /proc/vmstat vmstat.$TS.after The results for 2 consecutive runs on clean 5.3 root@test1:~# sh hugetlb_test.sh + echo 0 + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.0694 s, 51.0 MB/s + date +%s + TS=1569905284 + cp /proc/vmstat vmstat.1569905284.before + echo 256 + cat /proc/sys/vm/nr_hugepages 256 + cp /proc/vmstat vmstat.1569905284.after root@test1:~# sh hugetlb_test.sh + echo 0 + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.7548 s, 49.4 MB/s + date +%s + TS=1569905311 + cp /proc/vmstat vmstat.1569905311.before + echo 256 + cat /proc/sys/vm/nr_hugepages 256 + cp /proc/vmstat vmstat.1569905311.after so we get all the requested huge pages to the pool. Now with first 3 patches of this series applied (the last one doesn't make any difference for hugetlb allocations). root@test1:~# sh hugetlb_test.sh + echo 0 + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s + date +%s + TS=1569905516 + cp /proc/vmstat vmstat.1569905516.before + echo 256 + cat /proc/sys/vm/nr_hugepages 11 + cp /proc/vmstat vmstat.1569905516.after root@test1:~# sh hugetlb_test.sh + echo 0 + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s + date +%s + TS=1569905541 + cp /proc/vmstat vmstat.1569905541.before + echo 256 + cat /proc/sys/vm/nr_hugepages 12 + cp /proc/vmstat vmstat.1569905541.after so we do not get more that 12 huge pages which is really poor. Although hugetlb pages tend to be allocated early after the boot they are still an explicit admin request and having less than 5% success rate is really bad. If anything the __GFP_RETRY_MAYFAIL needs to be reflected in the code. I can provide vmstat files if anybody is interested. Then I have tried another test for THP. It is essentially the same thing. Populate the page cache to simulate a quite common case of memory being used for the cache and then populate 512M anonymous area with MADV_HUGEPAG set on it: $ cat thp_test.sh #!/bin/sh set -x echo 3 > /proc/sys/vm/drop_caches echo 1 > /proc/sys/vm/compact_memory dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10)) TS=$(date +%s) cp /proc/vmstat vmstat.$TS.before ./mem_eater nowait 500M cp /proc/vmstat vmstat.$TS.after mem_eater is essentially (mmap, madvise, and touch page by page dummy app). Again clean 5.3 kernel root@test1:~# sh thp_test.sh + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 20.8575 s, 51.5 MB/s + date +%s + TS=1569906274 + cp /proc/vmstat vmstat.1569906274.before + ./mem_eater nowait 500M 7f55e8282000-7f5607682000 rw-p 00000000 00:00 0 Size: 512000 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 512000 kB Pss: 512000 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 512000 kB Referenced: 260616 kB Anonymous: 512000 kB LazyFree: 0 kB AnonHugePages: 509952 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 1 + cp /proc/vmstat vmstat.1569906274.after root@test1:~# sh thp_test.sh + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.8648 s, 49.1 MB/s + date +%s + TS=1569906333 + cp /proc/vmstat vmstat.1569906333.before + ./mem_eater nowait 500M 7f26625cd000-7f26819cd000 rw-p 00000000 00:00 0 Size: 512000 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 512000 kB Pss: 512000 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 512000 kB Referenced: 259892 kB Anonymous: 512000 kB LazyFree: 0 kB AnonHugePages: 509952 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 1 + cp /proc/vmstat vmstat.1569906333.after We are getting quite consistent 99% THP utilization. grep "pgsteal_direct\|pgsteal_kswapd\|allocstall_movable\|compact_stall" vmstat.1569906333.{before,after} vmstat.1569906333.before:allocstall_movable 29 vmstat.1569906333.before:pgsteal_kswapd 206760 vmstat.1569906333.before:pgsteal_direct 30162 vmstat.1569906333.before:compact_stall 29 vmstat.1569906333.after:allocstall_movable 65 vmstat.1569906333.after:pgsteal_kswapd 298688 vmstat.1569906333.after:pgsteal_direct 67645 vmstat.1569906333.after:compact_stall 66 Hit the direct compaction 37 times and reclaimed 146M in direct 359M by kswapd which is 505M in total which is not bad for 512M request. 5.3 + 3 patches (This is a non-NUMA machine so the only difference the 4th patch would make is timing because it just retries 2 times on the same node). root@test1:~# sh thp_test.sh + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.0732 s, 51.0 MB/s + date +%s + TS=1569906542 + cp /proc/vmstat vmstat.1569906542.before + ./mem_eater nowait 500M 7f2799e08000-7f27b9208000 rw-p 00000000 00:00 0 Size: 512000 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 512000 kB Pss: 512000 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 512000 kB Referenced: 294944 kB Anonymous: 512000 kB LazyFree: 0 kB AnonHugePages: 477184 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 1 + cp /proc/vmstat vmstat.1569906542.after root@test1:~# sh thp_test.sh + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.9239 s, 49.0 MB/s + date +%s + TS=1569906569 + cp /proc/vmstat vmstat.1569906569.before + ./mem_eater nowait 500M 7fa29a234000-7fa2b9634000 rw-p 00000000 00:00 0 Size: 512000 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 512000 kB Pss: 512000 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 512000 kB Referenced: 253480 kB Anonymous: 512000 kB LazyFree: 0 kB AnonHugePages: 460800 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 1 + cp /proc/vmstat vmstat.1569906569.after The drop down in the utilization is not that rapid here but it shows that the results are not very stable as well. grep "pgsteal_direct\|pgsteal_kswapd\|allocstall_movable\|compact_stall" vmstat.1569906569.{before,after} vmstat.1569906569.before:allocstall_movable 52 vmstat.1569906569.before:pgsteal_kswapd 182617 vmstat.1569906569.before:pgsteal_direct 54281 vmstat.1569906569.before:compact_stall 85 vmstat.1569906569.after:allocstall_movable 64 vmstat.1569906569.after:pgsteal_kswapd 296840 vmstat.1569906569.after:pgsteal_direct 66778 vmstat.1569906569.after:compact_stall 191 We have hit the direct compaction 106 times and reclaimed 48M from direct and 446M from kswapd totaling 494M reclaimed in total. Slightly less than with clean 5.3 but I would consider it within noise. I didn't really get to analyze numbers very deeply but from a very preliminary look it seems that the bailout based on the watermark check is causing volatility because it depends on the kswapd activity in the background. Please note that this is pretty much the ideal case when the reclaimable memory is essentially free to drop. If kswapd starts fighting to get memory reclaimed then the THP utilization is likely to drop down as well. On the other hand it is fair to say that it is really hard to tell what would compaction do under those conditions. I also didn't really get to test any NUMA aspect of the change yet. I still do hope that David can share something I can play with because I do not want to create something completely artificial.
On Tue 01-10-19 07:43:43, Michal Hocko wrote: [...] > I also didn't really get to test any NUMA aspect of the change yet. I > still do hope that David can share something I can play with > because I do not want to create something completely artificial. I have split out my kvm machine into two nodes to get at least some idea how these patches behave $ numactl -H available: 2 nodes (0-1) node 0 cpus: 0 2 node 0 size: 475 MB node 0 free: 432 MB node 1 cpus: 1 3 node 1 size: 503 MB node 1 free: 458 MB Again, I am using the simple page cache full of memory and mmap less than the full capacity of node 1 with a preferred numa policy: root@test1:~# cat thp_test.sh #!/bin/sh set -x echo 3 > /proc/sys/vm/drop_caches echo 1 > /proc/sys/vm/compact_memory dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10)) TS=$(date +%s) cp /proc/vmstat vmstat.$TS.before numactl --preferred 1 ./mem_eater nowait 400M cp /proc/vmstat vmstat.$TS.after First run with 5.3 and without THP $ echo never > /sys/kernel/mm/transparent_hugepage/enabled root@test1:~# sh thp_test.sh + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.938 s, 48.9 MB/s + date +%s + TS=1569914851 + cp /proc/vmstat vmstat.1569914851.before + numactl --preferred 1 ./mem_eater nowait 400M 7f4bdefec000-7f4bf7fec000 rw-p 00000000 00:00 0 Size: 409600 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 409600 kB Pss: 409600 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 409600 kB Referenced: 344460 kB Anonymous: 409600 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 0 7f4bdefec000 prefer:1 anon=102400 dirty=102400 active=86115 N0=41963 N1=60437 kernelpagesize_kB=4 + cp /proc/vmstat vmstat.1569914851.after Few more runs: 7fd0f248b000 prefer:1 anon=102400 dirty=102400 active=86909 N0=40079 N1=62321 kernelpagesize_kB=4 7f2a69fc3000 prefer:1 anon=102400 dirty=102400 active=85244 N0=44455 N1=57945 kernelpagesize_kB=4 So we get around 56-60% pages to the preferred node Now let's enable THPs root@test1:~# echo madvise > /sys/kernel/mm/transparent_hugepage/enabled root@test1:~# sh thp_test.sh + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.8665 s, 49.1 MB/s + date +%s + TS=1569915082 + cp /proc/vmstat vmstat.1569915082.before + numactl --preferred 1 ./mem_eater nowait 400M 7f05c6dee000-7f05dfdee000 rw-p 00000000 00:00 0 Size: 409600 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 409600 kB Pss: 409600 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 409600 kB Referenced: 210872 kB Anonymous: 409600 kB LazyFree: 0 kB AnonHugePages: 407552 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 1 7f05c6dee000 prefer:1 anon=102400 dirty=102400 active=52718 N0=50688 N1=51712 kernelpagesize_kB=4 + cp /proc/vmstat vmstat.1569915082.after Few more runs AnonHugePages: 407552 kB 7effca1b9000 prefer:1 anon=102400 dirty=102400 active=65977 N0=53760 N1=48640 kernelpagesize_kB=4 AnonHugePages: 407552 kB 7f474bfc4000 prefer:1 anon=102400 dirty=102400 active=52676 N0=8704 N1=93696 kernelpagesize_kB=4 The utilization is again almost 100% and the preferred node usage varied a lot between 47-91%. The last one looks like the result we would like to achieve, right? Now with 5.3 + all 4 patches this time: root@test1:~# sh thp_test.sh + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 21.5368 s, 49.9 MB/s + date +%s + TS=1569915403 + cp /proc/vmstat vmstat.1569915403.before + numactl --preferred 1 ./mem_eater nowait 400M 7f8114ab4000-7f812dab4000 rw-p 00000000 00:00 0 Size: 409600 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 409600 kB Pss: 409600 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 409600 kB Referenced: 207568 kB Anonymous: 409600 kB LazyFree: 0 kB AnonHugePages: 401408 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 1 7f8114ab4000 prefer:1 anon=102400 dirty=102400 active=51892 N0=3072 N1=99328 kernelpagesize_kB=4 + cp /proc/vmstat vmstat.1569915403.after Few more runs: AnonHugePages: 376832 kB 7f37a1404000 prefer:1 anon=102400 dirty=102400 active=55204 N0=23153 N1=79247 kernelpagesize_kB=4 AnonHugePages: 372736 kB 7f4abe4af000 prefer:1 anon=102400 dirty=102400 active=52399 N0=23646 N1=78754 kernelpagesize_kB=4 The THP utilization varies again and the locality is higher in average 76+%. Which is even higher than the base page case. I was really wondering what is the reason for that So I've added a simple debugging diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 8caab1f81a52..565f667f6dcb 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2140,9 +2140,11 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, * to prefer hugepage backing, retry allowing remote * memory as well. */ - if (!page && (gfp & __GFP_DIRECT_RECLAIM)) + if (!page && (gfp & __GFP_DIRECT_RECLAIM)) { page = __alloc_pages_node(hpage_node, gfp | __GFP_NORETRY, order); + printk("XXX: %s\n", !page ? "fail" : hpage_node == page_to_nid(page)?"match":"fallback"); + } goto out; } Cases like AnonHugePages: 407552 kB 7f3ab2ebf000 prefer:1 anon=102400 dirty=102400 active=77503 N0=43520 N1=58880 kernelpagesize_kB=4 85 XXX: fallback a very successful ones on the other hand AnonHugePages: 280576 kB 7feffd9c2000 prefer:1 anon=102400 dirty=102400 active=52563 N0=3131 N1=99269 kernelpagesize_kB=4 62 XXX: fail 3 XXX: fallback Note that 62 failing THPs is 31744 pages which also explains much higher locality I suspect (we simply allocate small pages from the preferred node). This is just a very trivial test and I still have to grasp the outcome. My current feeling is that the whole thing is very timing specific and the higher local utilization depends on somebody else doing a reclaim on our behalf with patches applied. 5.3 without any patches behaves more stable although there is a slightly higher off node THP success rate but it also seems that the same overall THP success rate can be achieved from the local node as well so performing compaction would make more sense for those. With the simple hack on top of 5.3 diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9c9194959271..414d0eaa6287 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4685,7 +4685,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, { struct page *page; unsigned int alloc_flags = ALLOC_WMARK_LOW; - gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */ + gfp_t alloc_mask, first_alloc_mask; /* The gfp_t that was actually used for allocation */ struct alloc_context ac = { }; /* @@ -4711,7 +4711,10 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp_mask); /* First allocation attempt */ - page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac); + first_alloc_mask = alloc_mask; + if (order && (alloc_mask & __GFP_DIRECT_RECLAIM)) + first_alloc_mask |= __GFP_THISNODE; + page = get_page_from_freelist(first_alloc_mask, order, alloc_flags, &ac); if (likely(page)) goto out; I am getting AnonHugePages: 407552 kB 7f88a67ca000 prefer:1 anon=102400 dirty=102400 active=60362 N0=39424 N1=62976 kernelpagesize_kB=4 AnonHugePages: 407552 kB 7f18f0de5000 prefer:1 anon=102400 dirty=102400 active=51685 N0=34720 N1=67680 kernelpagesize_kB=4 AnonHugePages: 407552 kB 7f443f89e000 prefer:1 anon=102400 dirty=102400 active=52382 N0=62976 N1=39424 kernelpagesize_kB=4 While first two rounds looked very promising with 60%+ locality with a consistent THP utilization then the locality went crazy so there is more to look into. The fast path still seems to make some difference though.
On 10/1/19 7:43 AM, Michal Hocko wrote: > so we do not get more that 12 huge pages which is really poor. Although > hugetlb pages tend to be allocated early after the boot they are still > an explicit admin request and having less than 5% success rate is really > bad. If anything the __GFP_RETRY_MAYFAIL needs to be reflected in the > code. Yeah it's roughly what I expected, thanks for the testing. How about this patch on top? ---8<--- From 3ae67ab2274626c276ff2dd58794215a8461f045 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka <vbabka@suse.cz> Date: Tue, 1 Oct 2019 14:20:58 +0200 Subject: [RFC] mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations THP page faults now attempt a __GFP_THISNODE allocation first, which should only compact existing free memory, followed by another attempt that can allocate from any node using reclaim/compaction effort specified by global defrag setting and madvise. This patch makes the following changes to the scheme: - before the patch, the first allocation relies on a check for pageblock order and __GFP_IO. This however affects also the second attempt, and also hugetlb allocations and other allocations of whole pageblock. Instead of that, reuse the existing check for costly order __GFP_NORETRY allocations, and make sure the first THP attempt uses __GFP_NORETRY. As a side-effect, all costly order __GFP_NORETRY allocations will bail out if compaction needs reclaim, while previously they only bailed out when compaction was deferred due to previous failures. This should be still acceptable within the __GFP_NORETRY semantics. - before the patch, the second allocation attempt (on all nodes) was passing __GFP_NORETRY. This is redundant as the check for pageblock order (discussed above) was stronger. It's also contrary to madvise(MADV_HUGEPAGE) which means some effort to allocate THP is requested. After this patch, the second attempt doesn't pass __GFP_THISNODE nor __GFP_NORETRY. To sum up, THP page faults now try the following attempt: 1. local node only THP allocation with no reclaim, just compaction. 2. THP allocation from any node with effort determined by global defrag setting and VMA madvise 3. fallback to base pages on any node Signed-off-by: Vlastimil Babka <vbabka@suse.cz> --- mm/mempolicy.c | 16 +++++++++------- mm/page_alloc.c | 23 +++++------------------ 2 files changed, 14 insertions(+), 25 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 4ae967bcf954..2c48146f3ee2 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2129,18 +2129,20 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, nmask = policy_nodemask(gfp, pol); if (!nmask || node_isset(hpage_node, *nmask)) { mpol_cond_put(pol); + /* + * First, try to allocate THP only on local node, but + * don't reclaim unnecessarily, just compact. + */ page = __alloc_pages_node(hpage_node, - gfp | __GFP_THISNODE, order); + gfp | __GFP_THISNODE | __GFP_NORETRY, order); /* - * If hugepage allocations are configured to always - * synchronous compact or the vma has been madvised - * to prefer hugepage backing, retry allowing remote - * memory as well. + * If that fails, allow both compaction and reclaim, + * but on all nodes. */ - if (!page && (gfp & __GFP_DIRECT_RECLAIM)) + if (!page) page = __alloc_pages_node(hpage_node, - gfp | __GFP_NORETRY, order); + gfp, order); goto out; } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 15c2050c629b..da9075d4cdf6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4467,7 +4467,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (page) goto got_pg; - if (order >= pageblock_order && (gfp_mask & __GFP_IO)) { + /* + * Checks for costly allocations with __GFP_NORETRY, which + * includes some THP page fault allocations + */ + if (costly_order && (gfp_mask & __GFP_NORETRY)) { /* * If allocating entire pageblock(s) and compaction * failed because all zones are below low watermarks @@ -4487,23 +4491,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (compact_result == COMPACT_SKIPPED || compact_result == COMPACT_DEFERRED) goto nopage; - } - - /* - * Checks for costly allocations with __GFP_NORETRY, which - * includes THP page fault allocations - */ - if (costly_order && (gfp_mask & __GFP_NORETRY)) { - /* - * If compaction is deferred for high-order allocations, - * it is because sync compaction recently failed. If - * this is the case and the caller requested a THP - * allocation, we do not want to heavily disrupt the - * system, so we fail the allocation instead of entering - * direct reclaim. - */ - if (compact_result == COMPACT_DEFERRED) - goto nopage; /* * Looks like reclaim/compaction is worth trying, but
On Tue, 1 Oct 2019, Vlastimil Babka wrote: > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 4ae967bcf954..2c48146f3ee2 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2129,18 +2129,20 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, > nmask = policy_nodemask(gfp, pol); > if (!nmask || node_isset(hpage_node, *nmask)) { > mpol_cond_put(pol); > + /* > + * First, try to allocate THP only on local node, but > + * don't reclaim unnecessarily, just compact. > + */ > page = __alloc_pages_node(hpage_node, > - gfp | __GFP_THISNODE, order); > + gfp | __GFP_THISNODE | __GFP_NORETRY, order); The page allocator has heuristics to determine when compaction should be retried, reclaim should be retried, and the allocation itself should retry for high-order allocations (see PAGE_ALLOC_COSTLY_ORDER). PAGE_ALLOC_COSTLY_ORDER exists solely to avoid poor allocator behavior when reclaim itself is unlikely -- or disruptive enough -- in making that amount of contiguous memory available. Rather than papering over the poor feedback loop between compaction and reclaim that exists in the page allocator, it's probably better to improve that and determine when an allocation should fail or it's worthwhile to retry. That's a separate topic from NUMA locality of thp. In other words, we should likely address how compaction and reclaim is done for all high order-allocations in the page allocator itself rather than only here for hugepage allocations and relying on specific gfp bits to be set. Ask: if the allocation here should not retry regardless of why compaction failed, why should any high-order allocation (user or kernel) retry if compaction failed and at what order we should just fail? If hugetlb wants to stress this to the fullest extent possible, it already appropriately uses __GFP_RETRY_MAYFAIL. The code here is saying we care more about NUMA locality than hugepages simply because that's where the access latency is better and is specific to hugepages; allocation behavior for high-order pages needs to live in the page allocator. > > /* > - * If hugepage allocations are configured to always > - * synchronous compact or the vma has been madvised > - * to prefer hugepage backing, retry allowing remote > - * memory as well. > + * If that fails, allow both compaction and reclaim, > + * but on all nodes. > */ > - if (!page && (gfp & __GFP_DIRECT_RECLAIM)) > + if (!page) > page = __alloc_pages_node(hpage_node, > - gfp | __GFP_NORETRY, order); > + gfp, order); > > goto out; > } We certainly don't want this for non-MADV_HUGEPAGE regions when thp enabled bit is not set to "always". We'd much rather fallback to local native pages because of its improved access latency. This is allowing all hugepage allocations to be remote even without MADV_HUGEPAGE which is not even what Andrea needs: qemu uses MADV_HUGEPAGE.
On 10/1/19 10:31 PM, David Rientjes wrote: > On Tue, 1 Oct 2019, Vlastimil Babka wrote: > >> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >> index 4ae967bcf954..2c48146f3ee2 100644 >> --- a/mm/mempolicy.c >> +++ b/mm/mempolicy.c >> @@ -2129,18 +2129,20 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, >> nmask = policy_nodemask(gfp, pol); >> if (!nmask || node_isset(hpage_node, *nmask)) { >> mpol_cond_put(pol); >> + /* >> + * First, try to allocate THP only on local node, but >> + * don't reclaim unnecessarily, just compact. >> + */ >> page = __alloc_pages_node(hpage_node, >> - gfp | __GFP_THISNODE, order); >> + gfp | __GFP_THISNODE | __GFP_NORETRY, order); > > The page allocator has heuristics to determine when compaction should be > retried, reclaim should be retried, and the allocation itself should retry > for high-order allocations (see PAGE_ALLOC_COSTLY_ORDER). Yes, and flags like __GFP_NORETRY and __GFP_RETRY_MAYFAIL affect these heuristics according to the caller's willingness to wait vs availability of some kind of fallback. > PAGE_ALLOC_COSTLY_ORDER exists solely to avoid poor allocator behavior > when reclaim itself is unlikely -- or disruptive enough -- in making that > amount of contiguous memory available. The __GFP_NORETRY check for costly orders is one of the implementation details of that poor allocator behavior avoidance. > Rather than papering over the poor feedback loop between compaction and > reclaim that exists in the page allocator, it's probably better to improve > that and determine when an allocation should fail or it's worthwhile to > retry. That's a separate topic from NUMA locality of thp. I don't disagree. But we are doing that 'papering over' already and I simply believe it can be done better, until we can drop it. Right now at least the hugetlb allocations are regressing, as Michal showed. > In other words, we should likely address how compaction and reclaim is > done for all high order-allocations in the page allocator itself rather > than only here for hugepage allocations and relying on specific gfp bits > to be set. Well, with your patches, decisions are made based on pageblock order (affecting also hugetlbfs) and a specific __GFP_IO bit. I don't see how using a __GFP_NORETRY and costly order is worse - both are concepts already used to guide reclaim/compaction decisions. And it won't affect hugetlbfs. Also with some THP defrag modes, __GFP_NORETRY is also already being used. > Ask: if the allocation here should not retry regardless of why > compaction failed, why should any high-order allocation (user or kernel) > retry if compaction failed and at what order we should just fail? If that's refering to the quoted code above, the allocation here should not retry because it would lead to reclaiming just the local node, effectively a node reclaim mode, leading to the problems Andrea and Mel reported. That's because __GFP_THISNODE makes it rather specific, so we shouldn't conclude anything for the other kinds of allocations, as you're asking. > If > hugetlb wants to stress this to the fullest extent possible, it already > appropriately uses __GFP_RETRY_MAYFAIL. Which doesn't work anymore right now, and should again after this patch. > The code here is saying we care more about NUMA locality than hugepages That's why there's __GFP_THISNODE, yes. > simply because that's where the access latency is better and is specific > to hugepages; allocation behavior for high-order pages needs to live in > the page allocator. I'd rather add the __GFP_NORETRY to __GFP_THISNODE here to influence the allocation behavior, than reintroduce any __GFP_THISNODE checks in the page allocator. There are not that many __GFP_THISNODE users, but there are plenty of __GFP_NORETRY users, so it seems better to adjust behavior based on the latter flag. IIRC in the recent past you argued for putting back __GFP_NORETRY to all THP allocations including madvise, so why is there a problem now? >> >> /* >> - * If hugepage allocations are configured to always >> - * synchronous compact or the vma has been madvised >> - * to prefer hugepage backing, retry allowing remote >> - * memory as well. >> + * If that fails, allow both compaction and reclaim, >> + * but on all nodes. >> */ >> - if (!page && (gfp & __GFP_DIRECT_RECLAIM)) >> + if (!page) >> page = __alloc_pages_node(hpage_node, >> - gfp | __GFP_NORETRY, order); Also this __GFP_NORETRY was added by your patch, so I really don't see why do you think it's wrong in the first allocation attempt. >> + gfp, order); >> >> goto out; >> } > > We certainly don't want this for non-MADV_HUGEPAGE regions when thp > enabled bit is not set to "always". We'd much rather fallback to local > native pages because of its improved access latency. This is allowing all > hugepage allocations to be remote even without MADV_HUGEPAGE which is not > even what Andrea needs: qemu uses MADV_HUGEPAGE. OTOH it's better to use a remote THP than a remote base page, in case the local node is below watermark. But that would probably require a larger surgery of the allocator. But what you're saying is keep the gfp & __GFP_DIRECT_RECLAIM condition. I still suggest to drop the __GFP_NORETRY as that goes against MADV_HUGEPAGE.
On Tue 01-10-19 23:54:14, Vlastimil Babka wrote: > On 10/1/19 10:31 PM, David Rientjes wrote: [...] > > If > > hugetlb wants to stress this to the fullest extent possible, it already > > appropriately uses __GFP_RETRY_MAYFAIL. > > Which doesn't work anymore right now, and should again after this patch. I didn't get to fully digest the patch Vlastimil is proposing. (Ab)using __GFP_NORETRY is quite subtle but it is already in place with some explanation and a reference to THPs. So while I am not really happy it is at least something you can reason about. b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed") on the other hand has added a much more wider change which has clearly broken hugetlb and any __GFP_RETRY_MAYFAIL user of pageblock_order sized allocations. And that is much worse and something I was pointing at during the review and those concerns were never really addressed before merging. In any case this is something to be fixed ASAP. Do you have any better proposa? I do not assume you would be proposing yet another revert.
On Wed, 2 Oct 2019, Michal Hocko wrote: > > > If > > > hugetlb wants to stress this to the fullest extent possible, it already > > > appropriately uses __GFP_RETRY_MAYFAIL. > > > > Which doesn't work anymore right now, and should again after this patch. > > I didn't get to fully digest the patch Vlastimil is proposing. (Ab)using > __GFP_NORETRY is quite subtle but it is already in place with some > explanation and a reference to THPs. So while I am not really happy it > is at least something you can reason about. > It's a no-op: /* Do not loop if specifically requested */ if (gfp_mask & __GFP_NORETRY) goto nopage; /* * Do not retry costly high order allocations unless they are * __GFP_RETRY_MAYFAIL */ if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL)) goto nopage; So I'm not sure we should spend too much time discussing a hunk of a patch that doesn't do anything. > b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction > may not succeed") on the other hand has added a much more wider change > which has clearly broken hugetlb and any __GFP_RETRY_MAYFAIL user of > pageblock_order sized allocations. And that is much worse and something > I was pointing at during the review and those concerns were never really > addressed before merging. > > In any case this is something to be fixed ASAP. Do you have any better > proposa? I do not assume you would be proposing yet another revert. I thought Mike Kravetz said[*] that hugetlb was not negatively affected by this? We could certainly disregard this logic for __GFP_RETRY_MAYFAIL if anybody is relying on excessive reclaim ("swap storms") that does not allow compaction to make forward progress for some reason. [*] https://marc.info/?l=linux-kernel&m=156771690024533 If not, for the purposes of this conversation we can disregard __GFP_NORETRY per the above because thp does not use __GFP_RETRY_MAYFAIL.
On 10/3/19 12:32 AM, David Rientjes wrote: > On Wed, 2 Oct 2019, Michal Hocko wrote: > >>>> If >>>> hugetlb wants to stress this to the fullest extent possible, it already >>>> appropriately uses __GFP_RETRY_MAYFAIL. >>> >>> Which doesn't work anymore right now, and should again after this patch. >> >> I didn't get to fully digest the patch Vlastimil is proposing. (Ab)using >> __GFP_NORETRY is quite subtle but it is already in place with some >> explanation and a reference to THPs. So while I am not really happy it >> is at least something you can reason about. >> > > It's a no-op: > > /* Do not loop if specifically requested */ > if (gfp_mask & __GFP_NORETRY) > goto nopage; > > /* > * Do not retry costly high order allocations unless they are > * __GFP_RETRY_MAYFAIL > */ > if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL)) > goto nopage; > > So I'm not sure we should spend too much time discussing a hunk of a patch > that doesn't do anything. I believe Michal was talking about my (ab)use of __GFP_NORETRY, where it controls the earlier 'goto nopage' condition. Yes, with your patches alone, the addition of __GFP_NORETRY in the second attempt is a no-op, although then I don't see the point of confusing people reading the code with it.
On Thu 03-10-19 10:00:08, Vlastimil Babka wrote: > On 10/3/19 12:32 AM, David Rientjes wrote: > > On Wed, 2 Oct 2019, Michal Hocko wrote: > > > >>>> If > >>>> hugetlb wants to stress this to the fullest extent possible, it already > >>>> appropriately uses __GFP_RETRY_MAYFAIL. > >>> > >>> Which doesn't work anymore right now, and should again after this patch. > >> > >> I didn't get to fully digest the patch Vlastimil is proposing. (Ab)using > >> __GFP_NORETRY is quite subtle but it is already in place with some > >> explanation and a reference to THPs. So while I am not really happy it > >> is at least something you can reason about. > >> > > > > It's a no-op: > > > > /* Do not loop if specifically requested */ > > if (gfp_mask & __GFP_NORETRY) > > goto nopage; > > > > /* > > * Do not retry costly high order allocations unless they are > > * __GFP_RETRY_MAYFAIL > > */ > > if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL)) > > goto nopage; > > > > So I'm not sure we should spend too much time discussing a hunk of a patch > > that doesn't do anything. > > I believe Michal was talking about my (ab)use of __GFP_NORETRY, where it > controls the earlier 'goto nopage' condition. That is correct. From a maintainability point of view it would be better to have only a single bailout of an optimistic compaction attempt. If we go with [1] then we have two different criterion to bail out and that is really messy and error prone. While sticking __GFP_RETRY_MAYFAIL as suggest in [1] fixes up the immediate regression in the simplest way this all really begs for a proper analysis and a _real_ fix. Can we move that direction finally, please? I would really love to conduct further testing but I haven't really heard anything to results presented so far. I have no idea whether that is even remotely resembling anything David needs for his claimed regression. [1] http://lkml.kernel.org/r/alpine.DEB.2.21.1910021556270.187014@chino.kir.corp.google.com
It's been some time since I've posted these results. The hugetlb issue got resolved but I would still like to hear back about these findings because they suggest that the current bail out strategy doesn't seem to produce very good results. Essentially it doesn't really help THP locality (on moderately filled up nodes) and it introduces a strong dependency on kswapd which is not a source of the high order pages. Also the overral THP success rate is decreased on a pretty standard "RAM is used for page cache" workload. That makes me think that the only possible workload that might really benefit from this heuristic is a THP demanding one on a heavily fragmented node with a lot of free memory while other nodes are not fragmented and have quite a lot of free memory. If that is the case, is this something to optimize for? I am keeping all the results for the reference in a condensed form On Tue 01-10-19 10:37:43, Michal Hocko wrote: > I have split out my kvm machine into two nodes to get at least some > idea how these patches behave > $ numactl -H > available: 2 nodes (0-1) > node 0 cpus: 0 2 > node 0 size: 475 MB > node 0 free: 432 MB > node 1 cpus: 1 3 > node 1 size: 503 MB > node 1 free: 458 MB > > First run with 5.3 and without THP > $ echo never > /sys/kernel/mm/transparent_hugepage/enabled > root@test1:~# sh thp_test.sh > 7f4bdefec000 prefer:1 anon=102400 dirty=102400 active=86115 N0=41963 N1=60437 kernelpagesize_kB=4 > 7fd0f248b000 prefer:1 anon=102400 dirty=102400 active=86909 N0=40079 N1=62321 kernelpagesize_kB=4 > 7f2a69fc3000 prefer:1 anon=102400 dirty=102400 active=85244 N0=44455 N1=57945 kernelpagesize_kB=4 > > So we get around 56-60% pages to the preferred node > > Now let's enable THPs > AnonHugePages: 407552 kB > 7f05c6dee000 prefer:1 anon=102400 dirty=102400 active=52718 N0=50688 N1=51712 kernelpagesize_kB=4 > Few more runs > AnonHugePages: 407552 kB > 7effca1b9000 prefer:1 anon=102400 dirty=102400 active=65977 N0=53760 N1=48640 kernelpagesize_kB=4 > AnonHugePages: 407552 kB > 7f474bfc4000 prefer:1 anon=102400 dirty=102400 active=52676 N0=8704 N1=93696 kernelpagesize_kB=4 > > The utilization is again almost 100% and the preferred node usage > varied a lot between 47-91%. > > Now with 5.3 + all 4 patches this time: > AnonHugePages: 401408 kB > 7f8114ab4000 prefer:1 anon=102400 dirty=102400 active=51892 N0=3072 N1=99328 kernelpagesize_kB=4 > AnonHugePages: 376832 kB > 7f37a1404000 prefer:1 anon=102400 dirty=102400 active=55204 N0=23153 N1=79247 kernelpagesize_kB=4 > AnonHugePages: 372736 kB > 7f4abe4af000 prefer:1 anon=102400 dirty=102400 active=52399 N0=23646 N1=78754 kernelpagesize_kB=4 > > The THP utilization varies again and the locality is higher in average > 76+%. Which is even higher than the base page case. I was really > wondering what is the reason for that So I've added a simple debugging > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 8caab1f81a52..565f667f6dcb 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2140,9 +2140,11 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, > * to prefer hugepage backing, retry allowing remote > * memory as well. > */ > - if (!page && (gfp & __GFP_DIRECT_RECLAIM)) > + if (!page && (gfp & __GFP_DIRECT_RECLAIM)) { > page = __alloc_pages_node(hpage_node, > gfp | __GFP_NORETRY, order); > + printk("XXX: %s\n", !page ? "fail" : hpage_node == page_to_nid(page)?"match":"fallback"); > + } > > goto out; > } > > Cases like > AnonHugePages: 407552 kB > 7f3ab2ebf000 prefer:1 anon=102400 dirty=102400 active=77503 N0=43520 N1=58880 kernelpagesize_kB=4 > 85 XXX: fallback > a very successful ones on the other hand > AnonHugePages: 280576 kB > 7feffd9c2000 prefer:1 anon=102400 dirty=102400 active=52563 N0=3131 N1=99269 kernelpagesize_kB=4 > 62 XXX: fail > 3 XXX: fallback > > Note that 62 failing THPs is 31744 pages which also explains much higher > locality I suspect (we simply allocate small pages from the preferred > node). > > This is just a very trivial test and I still have to grasp the outcome. > My current feeling is that the whole thing is very timing specific and > the higher local utilization depends on somebody else doing a reclaim > on our behalf with patches applied. > > 5.3 without any patches behaves more stable although there is a slightly > higher off node THP success rate but it also seems that the same > overall THP success rate can be achieved from the local node as well so > performing compaction would make more sense for those. > > With the simple hack on top of 5.3 > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 9c9194959271..414d0eaa6287 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4685,7 +4685,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, > { > struct page *page; > unsigned int alloc_flags = ALLOC_WMARK_LOW; > - gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */ > + gfp_t alloc_mask, first_alloc_mask; /* The gfp_t that was actually used for allocation */ > struct alloc_context ac = { }; > > /* > @@ -4711,7 +4711,10 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, > alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp_mask); > > /* First allocation attempt */ > - page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac); > + first_alloc_mask = alloc_mask; > + if (order && (alloc_mask & __GFP_DIRECT_RECLAIM)) > + first_alloc_mask |= __GFP_THISNODE; > + page = get_page_from_freelist(first_alloc_mask, order, alloc_flags, &ac); > if (likely(page)) > goto out; > > I am getting > AnonHugePages: 407552 kB > 7f88a67ca000 prefer:1 anon=102400 dirty=102400 active=60362 N0=39424 N1=62976 kernelpagesize_kB=4 > > AnonHugePages: 407552 kB > 7f18f0de5000 prefer:1 anon=102400 dirty=102400 active=51685 N0=34720 N1=67680 kernelpagesize_kB=4 > > AnonHugePages: 407552 kB > 7f443f89e000 prefer:1 anon=102400 dirty=102400 active=52382 N0=62976 N1=39424 kernelpagesize_kB=4 > > While first two rounds looked very promising with 60%+ locality with a > consistent THP utilization then the locality went crazy so there is > more to look into. The fast path still seems to make some difference > though.
On 10/18/19 4:15 PM, Michal Hocko wrote: > It's been some time since I've posted these results. The hugetlb issue > got resolved but I would still like to hear back about these findings > because they suggest that the current bail out strategy doesn't seem > to produce very good results. Essentially it doesn't really help THP > locality (on moderately filled up nodes) and it introduces a strong > dependency on kswapd which is not a source of the high order pages. > Also the overral THP success rate is decreased on a pretty standard "RAM > is used for page cache" workload. > > That makes me think that the only possible workload that might really > benefit from this heuristic is a THP demanding one on a heavily > fragmented node with a lot of free memory while other nodes are not > fragmented and have quite a lot of free memory. If that is the case, is > this something to optimize for? > > I am keeping all the results for the reference in a condensed form > > On Tue 01-10-19 10:37:43, Michal Hocko wrote: >> I have split out my kvm machine into two nodes to get at least some >> idea how these patches behave >> $ numactl -H >> available: 2 nodes (0-1) >> node 0 cpus: 0 2 >> node 0 size: 475 MB >> node 0 free: 432 MB >> node 1 cpus: 1 3 >> node 1 size: 503 MB >> node 1 free: 458 MB >> >> First run with 5.3 and without THP >> $ echo never > /sys/kernel/mm/transparent_hugepage/enabled >> root@test1:~# sh thp_test.sh >> 7f4bdefec000 prefer:1 anon=102400 dirty=102400 active=86115 N0=41963 N1=60437 kernelpagesize_kB=4 >> 7fd0f248b000 prefer:1 anon=102400 dirty=102400 active=86909 N0=40079 N1=62321 kernelpagesize_kB=4 >> 7f2a69fc3000 prefer:1 anon=102400 dirty=102400 active=85244 N0=44455 N1=57945 kernelpagesize_kB=4 >> >> So we get around 56-60% pages to the preferred node >> >> Now let's enable THPs >> AnonHugePages: 407552 kB >> 7f05c6dee000 prefer:1 anon=102400 dirty=102400 active=52718 N0=50688 N1=51712 kernelpagesize_kB=4 >> Few more runs >> AnonHugePages: 407552 kB >> 7effca1b9000 prefer:1 anon=102400 dirty=102400 active=65977 N0=53760 N1=48640 kernelpagesize_kB=4 >> AnonHugePages: 407552 kB >> 7f474bfc4000 prefer:1 anon=102400 dirty=102400 active=52676 N0=8704 N1=93696 kernelpagesize_kB=4 >> >> The utilization is again almost 100% and the preferred node usage >> varied a lot between 47-91%. >> >> Now with 5.3 + all 4 patches this time: >> AnonHugePages: 401408 kB >> 7f8114ab4000 prefer:1 anon=102400 dirty=102400 active=51892 N0=3072 N1=99328 kernelpagesize_kB=4 >> AnonHugePages: 376832 kB >> 7f37a1404000 prefer:1 anon=102400 dirty=102400 active=55204 N0=23153 N1=79247 kernelpagesize_kB=4 >> AnonHugePages: 372736 kB >> 7f4abe4af000 prefer:1 anon=102400 dirty=102400 active=52399 N0=23646 N1=78754 kernelpagesize_kB=4 >> >> The THP utilization varies again and the locality is higher in average >> 76+%. Which is even higher than the base page case. I was really I tried to reproduce your setup locally, and got this for THP case on 5.4-rc4: AnonHugePages: 395264 kB 7fdc4a2c0000 prefer:1 anon=102400 dirty=102400 N0=48852 N1=53548 kernelpagesize_kB=4 AnonHugePages: 401408 kB 7f27167e2000 prefer:1 anon=102400 dirty=102400 N0=40095 N1=62305 kernelpagesize_kB=4 AnonHugePages: 378880 kB 7ff693ff9000 prefer:1 anon=102400 dirty=102400 N0=58061 N1=44339 kernelpagesize_kB=4 Somewhat better THP utilization and worse node locality than you. Then I applied a rebased patch that I proposed before (see below): AnonHugePages: 407552 kB 7f33fa83a000 prefer:1 anon=102400 dirty=102400 N0=28672 N1=73728 kernelpagesize_kB=4 AnonHugePages: 407552 kB 7faac0aa9000 prefer:1 anon=102400 dirty=102400 N0=48869 N1=53531 kernelpagesize_kB=4 AnonHugePages: 407552 kB 7f9f32c57000 prefer:1 anon=102400 dirty=102400 N0=49664 N1=52736 kernelpagesize_kB=4 The THP utilization is now back at 100% as 5.3 (modulo mis-alignment of the mem_eater area). This is expected, as the second try that's not limited to __GFP_THISNODE is also not limited by the newly introduced (in 5.4) heuristics that checks COMPACT_SKIPPED. Locality seems similar, can't make any conclusions with such variation and so few tries. Could you try confirming that as well? Thanks. But I agree the test is limited and probably depends on timing wrt kswapd making progress. ----8<---- From 8bd960e4e8e7e99fe13baf0d00b61910b3ae8d23 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka <vbabka@suse.cz> Date: Tue, 1 Oct 2019 14:20:58 +0200 Subject: [PATCH] mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations THP page faults now attempt a __GFP_THISNODE allocation first, which should only compact existing free memory, followed by another attempt that can allocate from any node using reclaim/compaction effort specified by global defrag setting and madvise. This patch makes the following changes to the scheme: - before the patch, the first allocation relies on a check for pageblock order and __GFP_IO to prevent excessive reclaim. This however affects also the second attempt, which is not limited to single node. Instead of that, reuse the existing check for costly order __GFP_NORETRY allocations, and make sure the first THP attempt uses __GFP_NORETRY. As a side-effect, all costly order __GFP_NORETRY allocations will bail out if compaction needs reclaim, while previously they only bailed out when compaction was deferred due to previous failures. This should be still acceptable within the __GFP_NORETRY semantics. - before the patch, the second allocation attempt (on all nodes) was passing __GFP_NORETRY. This is redundant as the check for pageblock order (discussed above) was stronger. It's also contrary to madvise(MADV_HUGEPAGE) which means some effort to allocate THP is requested. After this patch, the second attempt doesn't pass __GFP_THISNODE nor __GFP_NORETRY. To sum up, THP page faults now try the following attempt: 1. local node only THP allocation with no reclaim, just compaction. 2. THP allocation from any node with effort determined by global defrag setting and VMA madvise 3. fallback to base pages on any node Signed-off-by: Vlastimil Babka <vbabka@suse.cz> --- mm/mempolicy.c | 16 +++++++++------- mm/page_alloc.c | 24 +++++------------------- 2 files changed, 14 insertions(+), 26 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 4ae967bcf954..2c48146f3ee2 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2129,18 +2129,20 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, nmask = policy_nodemask(gfp, pol); if (!nmask || node_isset(hpage_node, *nmask)) { mpol_cond_put(pol); + /* + * First, try to allocate THP only on local node, but + * don't reclaim unnecessarily, just compact. + */ page = __alloc_pages_node(hpage_node, - gfp | __GFP_THISNODE, order); + gfp | __GFP_THISNODE | __GFP_NORETRY, order); /* - * If hugepage allocations are configured to always - * synchronous compact or the vma has been madvised - * to prefer hugepage backing, retry allowing remote - * memory as well. + * If that fails, allow both compaction and reclaim, + * but on all nodes. */ - if (!page && (gfp & __GFP_DIRECT_RECLAIM)) + if (!page) page = __alloc_pages_node(hpage_node, - gfp | __GFP_NORETRY, order); + gfp, order); goto out; } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ecc3dbad606b..36d7d852f7b1 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4473,8 +4473,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (page) goto got_pg; - if (order >= pageblock_order && (gfp_mask & __GFP_IO) && - !(gfp_mask & __GFP_RETRY_MAYFAIL)) { + /* + * Checks for costly allocations with __GFP_NORETRY, which + * includes some THP page fault allocations + */ + if (costly_order && (gfp_mask & __GFP_NORETRY)) { /* * If allocating entire pageblock(s) and compaction * failed because all zones are below low watermarks @@ -4495,23 +4498,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (compact_result == COMPACT_SKIPPED || compact_result == COMPACT_DEFERRED) goto nopage; - } - - /* - * Checks for costly allocations with __GFP_NORETRY, which - * includes THP page fault allocations - */ - if (costly_order && (gfp_mask & __GFP_NORETRY)) { - /* - * If compaction is deferred for high-order allocations, - * it is because sync compaction recently failed. If - * this is the case and the caller requested a THP - * allocation, we do not want to heavily disrupt the - * system, so we fail the allocation instead of entering - * direct reclaim. - */ - if (compact_result == COMPACT_DEFERRED) - goto nopage; /* * Looks like reclaim/compaction is worth trying, but
On Wed, 23 Oct 2019, Vlastimil Babka wrote: > From 8bd960e4e8e7e99fe13baf0d00b61910b3ae8d23 Mon Sep 17 00:00:00 2001 > From: Vlastimil Babka <vbabka@suse.cz> > Date: Tue, 1 Oct 2019 14:20:58 +0200 > Subject: [PATCH] mm, thp: tweak reclaim/compaction effort of local-only and > all-node allocations > > THP page faults now attempt a __GFP_THISNODE allocation first, which should > only compact existing free memory, followed by another attempt that can > allocate from any node using reclaim/compaction effort specified by global > defrag setting and madvise. > > This patch makes the following changes to the scheme: > > - before the patch, the first allocation relies on a check for pageblock order > and __GFP_IO to prevent excessive reclaim. This however affects also the > second attempt, which is not limited to single node. Instead of that, reuse > the existing check for costly order __GFP_NORETRY allocations, and make sure > the first THP attempt uses __GFP_NORETRY. As a side-effect, all costly order > __GFP_NORETRY allocations will bail out if compaction needs reclaim, while > previously they only bailed out when compaction was deferred due to previous > failures. This should be still acceptable within the __GFP_NORETRY semantics. > > - before the patch, the second allocation attempt (on all nodes) was passing > __GFP_NORETRY. This is redundant as the check for pageblock order (discussed > above) was stronger. It's also contrary to madvise(MADV_HUGEPAGE) which means > some effort to allocate THP is requested. After this patch, the second > attempt doesn't pass __GFP_THISNODE nor __GFP_NORETRY. > > To sum up, THP page faults now try the following attempt: > > 1. local node only THP allocation with no reclaim, just compaction. > 2. THP allocation from any node with effort determined by global defrag setting > and VMA madvise > 3. fallback to base pages on any node > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> > --- > mm/mempolicy.c | 16 +++++++++------- > mm/page_alloc.c | 24 +++++------------------- > 2 files changed, 14 insertions(+), 26 deletions(-) > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 4ae967bcf954..2c48146f3ee2 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2129,18 +2129,20 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, > nmask = policy_nodemask(gfp, pol); > if (!nmask || node_isset(hpage_node, *nmask)) { > mpol_cond_put(pol); > + /* > + * First, try to allocate THP only on local node, but > + * don't reclaim unnecessarily, just compact. > + */ > page = __alloc_pages_node(hpage_node, > - gfp | __GFP_THISNODE, order); > + gfp | __GFP_THISNODE | __GFP_NORETRY, order); > > /* > - * If hugepage allocations are configured to always > - * synchronous compact or the vma has been madvised > - * to prefer hugepage backing, retry allowing remote > - * memory as well. > + * If that fails, allow both compaction and reclaim, > + * but on all nodes. > */ > - if (!page && (gfp & __GFP_DIRECT_RECLAIM)) > + if (!page) > page = __alloc_pages_node(hpage_node, > - gfp | __GFP_NORETRY, order); > + gfp, order); > > goto out; > } Hi Vlastimil, For the default case where thp enabled is not set to "always" and the VMA is not madvised for MADV_HUGEPAGE, how does this prefer to return node local pages rather than remote hugepages? The idea is to optimize for access latency when the vma has not been explicitly madvised. > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index ecc3dbad606b..36d7d852f7b1 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4473,8 +4473,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > if (page) > goto got_pg; > > - if (order >= pageblock_order && (gfp_mask & __GFP_IO) && > - !(gfp_mask & __GFP_RETRY_MAYFAIL)) { > + /* > + * Checks for costly allocations with __GFP_NORETRY, which > + * includes some THP page fault allocations > + */ > + if (costly_order && (gfp_mask & __GFP_NORETRY)) { > /* > * If allocating entire pageblock(s) and compaction > * failed because all zones are below low watermarks > @@ -4495,23 +4498,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > if (compact_result == COMPACT_SKIPPED || > compact_result == COMPACT_DEFERRED) > goto nopage; > - } > - > - /* > - * Checks for costly allocations with __GFP_NORETRY, which > - * includes THP page fault allocations > - */ > - if (costly_order && (gfp_mask & __GFP_NORETRY)) { > - /* > - * If compaction is deferred for high-order allocations, > - * it is because sync compaction recently failed. If > - * this is the case and the caller requested a THP > - * allocation, we do not want to heavily disrupt the > - * system, so we fail the allocation instead of entering > - * direct reclaim. > - */ > - if (compact_result == COMPACT_DEFERRED) > - goto nopage; > > /* > * Looks like reclaim/compaction is worth trying, but > -- > 2.23.0 > >
On 10/24/19 8:59 PM, David Rientjes wrote: >> diff --git a/mm/mempolicy.c b/mm/mempolicy.c >> index 4ae967bcf954..2c48146f3ee2 100644 >> --- a/mm/mempolicy.c >> +++ b/mm/mempolicy.c >> @@ -2129,18 +2129,20 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, >> nmask = policy_nodemask(gfp, pol); >> if (!nmask || node_isset(hpage_node, *nmask)) { >> mpol_cond_put(pol); >> + /* >> + * First, try to allocate THP only on local node, but >> + * don't reclaim unnecessarily, just compact. >> + */ >> page = __alloc_pages_node(hpage_node, >> - gfp | __GFP_THISNODE, order); >> + gfp | __GFP_THISNODE | __GFP_NORETRY, order); >> >> /* >> - * If hugepage allocations are configured to always >> - * synchronous compact or the vma has been madvised >> - * to prefer hugepage backing, retry allowing remote >> - * memory as well. >> + * If that fails, allow both compaction and reclaim, >> + * but on all nodes. >> */ >> - if (!page && (gfp & __GFP_DIRECT_RECLAIM)) >> + if (!page) >> page = __alloc_pages_node(hpage_node, >> - gfp | __GFP_NORETRY, order); >> + gfp, order); >> >> goto out; >> } > Hi Vlastimil, > > For the default case where thp enabled is not set to "always" and the VMA I assume you meant "defrag" instead of "enabled". > is not madvised for MADV_HUGEPAGE, how does this prefer to return node > local pages rather than remote hugepages? The idea is to optimize for > access latency when the vma has not been explicitly madvised. Right, you mentioned this before IIRC, and I forgot. How about this? We could be also smarter and consolidate to a single attempt if there's actually just a single NUMA node, but that can be optimized later. ----8<---- From 4c3a2217d0ee5ead00b1443010d07c664b6ac645 Mon Sep 17 00:00:00 2001 From: Vlastimil Babka <vbabka@suse.cz> Date: Tue, 1 Oct 2019 14:20:58 +0200 Subject: [PATCH] mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations THP page faults now attempt a __GFP_THISNODE allocation first, which should only compact existing free memory, followed by another attempt that can allocate from any node using reclaim/compaction effort specified by global defrag setting and madvise. This patch makes the following changes to the scheme: - before the patch, the first allocation relies on a check for pageblock order and __GFP_IO to prevent excessive reclaim. This however affects also the second attempt, which is not limited to single node. Instead of that, reuse the existing check for costly order __GFP_NORETRY allocations, and make sure the first THP attempt uses __GFP_NORETRY. As a side-effect, all costly order __GFP_NORETRY allocations will bail out if compaction needs reclaim, while previously they only bailed out when compaction was deferred due to previous failures. This should be still acceptable within the __GFP_NORETRY semantics. - before the patch, the second allocation attempt (on all nodes) was passing __GFP_NORETRY. This is redundant as the check for pageblock order (discussed above) was stronger. It's also contrary to madvise(MADV_HUGEPAGE) which means some effort to allocate THP is requested. After this patch, the second attempt doesn't pass __GFP_THISNODE nor __GFP_NORETRY. To sum up, THP page faults now try the following attempts: 1. local node only THP allocation with no reclaim, just compaction. 2. for madvised VMA's or when synchronous compaction is enabled always - THP allocation from any node with effort determined by global defrag setting and VMA madvise 3. fallback to base pages on any node Signed-off-by: Vlastimil Babka <vbabka@suse.cz> --- mm/mempolicy.c | 10 +++++++--- mm/page_alloc.c | 24 +++++------------------- 2 files changed, 12 insertions(+), 22 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 4ae967bcf954..ed6fbc5b1e20 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2129,18 +2129,22 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, nmask = policy_nodemask(gfp, pol); if (!nmask || node_isset(hpage_node, *nmask)) { mpol_cond_put(pol); + /* + * First, try to allocate THP only on local node, but + * don't reclaim unnecessarily, just compact. + */ page = __alloc_pages_node(hpage_node, - gfp | __GFP_THISNODE, order); + gfp | __GFP_THISNODE | __GFP_NORETRY, order); /* * If hugepage allocations are configured to always * synchronous compact or the vma has been madvised * to prefer hugepage backing, retry allowing remote - * memory as well. + * memory with both reclaim and compact as well. */ if (!page && (gfp & __GFP_DIRECT_RECLAIM)) page = __alloc_pages_node(hpage_node, - gfp | __GFP_NORETRY, order); + gfp, order); goto out; } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ecc3dbad606b..36d7d852f7b1 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4473,8 +4473,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (page) goto got_pg; - if (order >= pageblock_order && (gfp_mask & __GFP_IO) && - !(gfp_mask & __GFP_RETRY_MAYFAIL)) { + /* + * Checks for costly allocations with __GFP_NORETRY, which + * includes some THP page fault allocations + */ + if (costly_order && (gfp_mask & __GFP_NORETRY)) { /* * If allocating entire pageblock(s) and compaction * failed because all zones are below low watermarks @@ -4495,23 +4498,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (compact_result == COMPACT_SKIPPED || compact_result == COMPACT_DEFERRED) goto nopage; - } - - /* - * Checks for costly allocations with __GFP_NORETRY, which - * includes THP page fault allocations - */ - if (costly_order && (gfp_mask & __GFP_NORETRY)) { - /* - * If compaction is deferred for high-order allocations, - * it is because sync compaction recently failed. If - * this is the case and the caller requested a THP - * allocation, we do not want to heavily disrupt the - * system, so we fail the allocation instead of entering - * direct reclaim. - */ - if (compact_result == COMPACT_DEFERRED) - goto nopage; /* * Looks like reclaim/compaction is worth trying, but
On Tue 29-10-19 15:14:59, Vlastimil Babka wrote: > >From 4c3a2217d0ee5ead00b1443010d07c664b6ac645 Mon Sep 17 00:00:00 2001 > From: Vlastimil Babka <vbabka@suse.cz> > Date: Tue, 1 Oct 2019 14:20:58 +0200 > Subject: [PATCH] mm, thp: tweak reclaim/compaction effort of local-only and > all-node allocations > > THP page faults now attempt a __GFP_THISNODE allocation first, which should > only compact existing free memory, followed by another attempt that can > allocate from any node using reclaim/compaction effort specified by global > defrag setting and madvise. > > This patch makes the following changes to the scheme: > > - before the patch, the first allocation relies on a check for pageblock order > and __GFP_IO to prevent excessive reclaim. This however affects also the > second attempt, which is not limited to single node. Instead of that, reuse > the existing check for costly order __GFP_NORETRY allocations, and make sure > the first THP attempt uses __GFP_NORETRY. As a side-effect, all costly order > __GFP_NORETRY allocations will bail out if compaction needs reclaim, while > previously they only bailed out when compaction was deferred due to previous > failures. This should be still acceptable within the __GFP_NORETRY semantics. > > - before the patch, the second allocation attempt (on all nodes) was passing > __GFP_NORETRY. This is redundant as the check for pageblock order (discussed > above) was stronger. It's also contrary to madvise(MADV_HUGEPAGE) which means > some effort to allocate THP is requested. After this patch, the second > attempt doesn't pass __GFP_THISNODE nor __GFP_NORETRY. > > To sum up, THP page faults now try the following attempts: > > 1. local node only THP allocation with no reclaim, just compaction. > 2. for madvised VMA's or when synchronous compaction is enabled always - THP > allocation from any node with effort determined by global defrag setting > and VMA madvise > 3. fallback to base pages on any node > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> I've given this a try and here are the results of my previous testcase (memory full of page cache). root@test1:~# sh thp_test.sh + echo 3 + echo 1 + dd if=/mnt/data/file-1G of=/dev/null bs=4096 262144+0 records in 262144+0 records out 1073741824 bytes (1.1 GB) copied, 18.5235 s, 58.0 MB/s + date +%s + TS=1572361069 + cp /proc/vmstat vmstat.1572361069.before + numactl --preferred 1 ./mem_eater nowait 400M 7f959336a000-7f95ac36a000 rw-p 00000000 00:00 0 Size: 409600 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 409600 kB Pss: 409600 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 409600 kB Referenced: 208296 kB Anonymous: 409600 kB LazyFree: 0 kB AnonHugePages: 407552 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 1 7f959336a000 prefer:1 anon=102400 dirty=102400 active=52074 N0=32768 N1=69632 kernelpagesize_kB=4 + cp /proc/vmstat vmstat.1572361069.after AnonHugePages: 407552 kB 7f590e2b3000 prefer:1 anon=102400 dirty=102400 active=61619 N0=45235 N1=57165 kernelpagesize_kB=4 AnonHugePages: 407552 kB 7f8667e9a000 prefer:1 anon=102400 dirty=102400 active=52378 N0=6144 N1=96256 kernelpagesize_kB=4 So the THP utilization is back to 100% (modulo alignment) and the node locality is 55%-94% - with an obviously high variance based on the kswapd activity in line with the kswapd based feedback backoff. All that with a high THP utilization which sounds like an approvement. The code is definitely less hackish so I would go with it for the time being. I still believe that we have a more fundamental problem to deal with though. All these open coded fallback allocations just point out that our allocator doesn't cooperate with he kcompactd properly and that should be fixed longterm. Acked-by: Michal Hocko <mhocko@suse.com> > --- > mm/mempolicy.c | 10 +++++++--- > mm/page_alloc.c | 24 +++++------------------- > 2 files changed, 12 insertions(+), 22 deletions(-) > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 4ae967bcf954..ed6fbc5b1e20 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2129,18 +2129,22 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, > nmask = policy_nodemask(gfp, pol); > if (!nmask || node_isset(hpage_node, *nmask)) { > mpol_cond_put(pol); > + /* > + * First, try to allocate THP only on local node, but > + * don't reclaim unnecessarily, just compact. > + */ > page = __alloc_pages_node(hpage_node, > - gfp | __GFP_THISNODE, order); > + gfp | __GFP_THISNODE | __GFP_NORETRY, order); > > /* > * If hugepage allocations are configured to always > * synchronous compact or the vma has been madvised > * to prefer hugepage backing, retry allowing remote > - * memory as well. > + * memory with both reclaim and compact as well. > */ > if (!page && (gfp & __GFP_DIRECT_RECLAIM)) > page = __alloc_pages_node(hpage_node, > - gfp | __GFP_NORETRY, order); > + gfp, order); > > goto out; > } > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index ecc3dbad606b..36d7d852f7b1 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4473,8 +4473,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > if (page) > goto got_pg; > > - if (order >= pageblock_order && (gfp_mask & __GFP_IO) && > - !(gfp_mask & __GFP_RETRY_MAYFAIL)) { > + /* > + * Checks for costly allocations with __GFP_NORETRY, which > + * includes some THP page fault allocations > + */ > + if (costly_order && (gfp_mask & __GFP_NORETRY)) { > /* > * If allocating entire pageblock(s) and compaction > * failed because all zones are below low watermarks > @@ -4495,23 +4498,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > if (compact_result == COMPACT_SKIPPED || > compact_result == COMPACT_DEFERRED) > goto nopage; > - } > - > - /* > - * Checks for costly allocations with __GFP_NORETRY, which > - * includes THP page fault allocations > - */ > - if (costly_order && (gfp_mask & __GFP_NORETRY)) { > - /* > - * If compaction is deferred for high-order allocations, > - * it is because sync compaction recently failed. If > - * this is the case and the caller requested a THP > - * allocation, we do not want to heavily disrupt the > - * system, so we fail the allocation instead of entering > - * direct reclaim. > - */ > - if (compact_result == COMPACT_DEFERRED) > - goto nopage; > > /* > * Looks like reclaim/compaction is worth trying, but > -- > 2.23.0
On Tue, 29 Oct 2019 16:15:49 +0100 Michal Hocko <mhocko@kernel.org> wrote: > > > > 1. local node only THP allocation with no reclaim, just compaction. > > 2. for madvised VMA's or when synchronous compaction is enabled always - THP > > allocation from any node with effort determined by global defrag setting > > and VMA madvise > > 3. fallback to base pages on any node > > > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> > > I've given this a try and here are the results of my previous testcase > (memory full of page cache). Thanks, I'll queue this for some more testing. At some point we should decide on a suitable set of Fixes: tags and a backporting strategy, if any?
On 10/29/19 10:33 PM, Andrew Morton wrote: > On Tue, 29 Oct 2019 16:15:49 +0100 Michal Hocko <mhocko@kernel.org> wrote: > >>> >>> 1. local node only THP allocation with no reclaim, just compaction. >>> 2. for madvised VMA's or when synchronous compaction is enabled always - THP >>> allocation from any node with effort determined by global defrag setting >>> and VMA madvise >>> 3. fallback to base pages on any node >>> >>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> >> >> I've given this a try and here are the results of my previous testcase >> (memory full of page cache). > > Thanks, I'll queue this for some more testing. At some point we should > decide on a suitable set of Fixes: tags and a backporting strategy, if any? I guess this below, as 3f36d8669457 had and this patch is similar kind of refinement. Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed") If accepted, would be nice if it made it to 5.4, or at least 5.4.y (it will be a LTSS AFAIK). Not sure if we should backport all 5.4 changes to 5.3.y as that will be EOL rather soon after 5.4 final? I guess that's for David. And for going to older LTSS's we would at least need more longer-term confidence and tests on real workloads - Andrea?
On Tue, 29 Oct 2019, Andrew Morton wrote: > On Tue, 29 Oct 2019 16:15:49 +0100 Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > 1. local node only THP allocation with no reclaim, just compaction. > > > 2. for madvised VMA's or when synchronous compaction is enabled always - THP > > > allocation from any node with effort determined by global defrag setting > > > and VMA madvise > > > 3. fallback to base pages on any node > > > > > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> > > > > I've given this a try and here are the results of my previous testcase > > (memory full of page cache). > > Thanks, I'll queue this for some more testing. At some point we should > decide on a suitable set of Fixes: tags and a backporting strategy, if any? > I'd strongly suggest that Andrea test this patch out on his workload on hosts where all nodes are low on memory because based on my understanding of his reported issue this would result in swap storms reemerging but worse this time because they wouldn't be constrained only locally. (This patch causes us to no longer circumvent excessive reclaim when using MADV_HUGEPAGE.)
On Tue 29-10-19 16:25:17, David Rientjes wrote: > On Tue, 29 Oct 2019, Andrew Morton wrote: > > > On Tue, 29 Oct 2019 16:15:49 +0100 Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > > > > 1. local node only THP allocation with no reclaim, just compaction. > > > > 2. for madvised VMA's or when synchronous compaction is enabled always - THP > > > > allocation from any node with effort determined by global defrag setting > > > > and VMA madvise > > > > 3. fallback to base pages on any node > > > > > > > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> > > > > > > I've given this a try and here are the results of my previous testcase > > > (memory full of page cache). > > > > Thanks, I'll queue this for some more testing. At some point we should > > decide on a suitable set of Fixes: tags and a backporting strategy, if any? > > > > I'd strongly suggest that Andrea test this patch out on his workload on > hosts where all nodes are low on memory because based on my understanding > of his reported issue this would result in swap storms reemerging but > worse this time because they wouldn't be constrained only locally. (This > patch causes us to no longer circumvent excessive reclaim when using > MADV_HUGEPAGE.) Could you be more specific on why this would be the case? My testing is doesn't show any such signs and I am effectivelly testing memory low situation. The amount of reclaimed memory matches the amount of requested memory.
On Tue, 5 Nov 2019, Michal Hocko wrote: > > > Thanks, I'll queue this for some more testing. At some point we should > > > decide on a suitable set of Fixes: tags and a backporting strategy, if any? > > > > > > > I'd strongly suggest that Andrea test this patch out on his workload on > > hosts where all nodes are low on memory because based on my understanding > > of his reported issue this would result in swap storms reemerging but > > worse this time because they wouldn't be constrained only locally. (This > > patch causes us to no longer circumvent excessive reclaim when using > > MADV_HUGEPAGE.) > > Could you be more specific on why this would be the case? My testing is > doesn't show any such signs and I am effectivelly testing memory low > situation. The amount of reclaimed memory matches the amount of > requested memory. > The follow-up allocation in alloc_pages_vma() would no longer use __GFP_NORETRY and there is no special handling to avoid swap storms in the page allocator anymore as a result of this patch. I don't see any indication that this allocation would behave any different than the code that Andrea experienced swap storms with, but now worse if remote memory is in the same state local memory is when he's using __GFP_THISNODE.
On Tue 05-11-19 17:01:00, David Rientjes wrote: > On Tue, 5 Nov 2019, Michal Hocko wrote: > > > > > Thanks, I'll queue this for some more testing. At some point we should > > > > decide on a suitable set of Fixes: tags and a backporting strategy, if any? > > > > > > > > > > I'd strongly suggest that Andrea test this patch out on his workload on > > > hosts where all nodes are low on memory because based on my understanding > > > of his reported issue this would result in swap storms reemerging but > > > worse this time because they wouldn't be constrained only locally. (This > > > patch causes us to no longer circumvent excessive reclaim when using > > > MADV_HUGEPAGE.) > > > > Could you be more specific on why this would be the case? My testing is > > doesn't show any such signs and I am effectivelly testing memory low > > situation. The amount of reclaimed memory matches the amount of > > requested memory. > > > > The follow-up allocation in alloc_pages_vma() would no longer use > __GFP_NORETRY and there is no special handling to avoid swap storms in the > page allocator anymore as a result of this patch. Yes there is no __GFP_NORETRY in the fallback path because the control over how hard to retry is controlled by alloc_hugepage_direct_gfpmask depending on the defrag mode and madvise mode. > I don't see any > indication that this allocation would behave any different than the code > that Andrea experienced swap storms with, but now worse if remote memory > is in the same state local memory is when he's using __GFP_THISNODE. The primary reason for the extensive swapping was exactly the __GFP_THISNODE in conjunction with an unbounded direct reclaim AFAIR. The whole point of the Vlastimil's patch is to have an optimistic local node allocation first and the full gfp context one in the fallback path. If our full gfp context doesn't really work well then we can revisit that of course but that should happen at alloc_hugepage_direct_gfpmask level.
On Wed, 6 Nov 2019, Michal Hocko wrote: > > I don't see any > > indication that this allocation would behave any different than the code > > that Andrea experienced swap storms with, but now worse if remote memory > > is in the same state local memory is when he's using __GFP_THISNODE. > > The primary reason for the extensive swapping was exactly the __GFP_THISNODE > in conjunction with an unbounded direct reclaim AFAIR. > > The whole point of the Vlastimil's patch is to have an optimistic local > node allocation first and the full gfp context one in the fallback path. > If our full gfp context doesn't really work well then we can revisit > that of course but that should happen at alloc_hugepage_direct_gfpmask > level. Since the patch reverts the precaution put into the page allocator to not attempt reclaim if the allocation order is significantly large and the return value from compaction specifies it is unlikely to succed on its own, I believe Vlastimil's patch will cause the same regression that Andrea saw is the whole host is low on memory and/or significantly fragmented. So the suggestion was that he test this change to make sure we aren't introducing a regression for his workload.
On Wed, Nov 06, 2019 at 01:32:37PM -0800, David Rientjes wrote: > On Wed, 6 Nov 2019, Michal Hocko wrote: > > > > I don't see any > > > indication that this allocation would behave any different than the code > > > that Andrea experienced swap storms with, but now worse if remote memory > > > is in the same state local memory is when he's using __GFP_THISNODE. > > > > The primary reason for the extensive swapping was exactly the __GFP_THISNODE > > in conjunction with an unbounded direct reclaim AFAIR. > > > > The whole point of the Vlastimil's patch is to have an optimistic local > > node allocation first and the full gfp context one in the fallback path. > > If our full gfp context doesn't really work well then we can revisit > > that of course but that should happen at alloc_hugepage_direct_gfpmask > > level. > > Since the patch reverts the precaution put into the page allocator to not > attempt reclaim if the allocation order is significantly large and the > return value from compaction specifies it is unlikely to succed on its > own, I believe Vlastimil's patch will cause the same regression that > Andrea saw is the whole host is low on memory and/or significantly > fragmented. So the suggestion was that he test this change to make sure > we aren't introducing a regression for his workload. TLDR: I do not have evidence that Vlastimil's patch causes more swapping but more information is needed from Andrea on exactly how he's testing this. It's not clear to me what was originally tested and whether memory just had to be full or whether it had to be fragmented. If fragmented, then we have to agree on what an appropriate mechanism is for fragmenting memory. Hypothetical kernel modules that don't exist do not count. I put together a testcase whereby a virtual machine is deployed, started and then time how long it takes to run memhog on 80% of the guests physical memory. I varied how large the virtual machine is and ran it on a 2-socket machine so that the smaller tests would be single node and the larger tests would span both nodes. Before each startup, a large file is read to fill the memory with pagecache. kvmstart 5.2.0 5.3.0 5.4.0-rc6 5.4.0-rc6 vanilla vanilla vanilla thptweak-v1r1 Amean 5 3.43 ( 0.00%) 3.44 ( -0.29%) 3.44 ( -0.19%) 3.39 ( 1.07%) Amean 14 11.18 ( 0.00%) 14.59 ( -30.53%) 14.85 ( -32.80%) 10.30 ( 7.87%) Amean 23 20.89 ( 0.00%) 18.45 ( 11.70%) 24.92 ( -19.29%) 24.88 ( -19.12%) Amean 32 51.69 ( 0.00%) 30.07 ( 41.82%) 30.93 ( 40.16%) 47.97 ( 7.20%) Amean 41 51.44 ( 0.00%) 29.99 * 41.71%* 60.75 ( -18.08%) 77.44 * -50.54%* Amean 50 81.85 ( 0.00%) 60.37 ( 26.25%) 98.09 ( -19.84%) 125.59 * -53.43%* Units are in seconds to run memhog. "Amean 5" is for a 5G virtual machine. Target machine is 2-socket with 64G of RAM so at 32 you'd expect that that the machine is larger than a NUMA node. It's set to cutoff at a virtual machine 80% the size of physical memory. I used 5.2 as a baseline as 5.3 is where THP allocations changed behaviour which was reverted later and then tweaked. The results show that 5.4.0-rc6 in general is slower to fault all the memory of the virtual machine even before you'd expect the machine to be larger than a NUMA node. The patch works better for smaller machines and worse for larger machines 5.2.0 5.3.0 5.4.0-rc6 5.4.0-rc6 vanilla vanilla vanilla thptweak-v1r1 Ops Swap Ins 1442665.00 1400209.00 1289549.00 1762479.00 Ops Swap Outs 2291255.00 1689554.00 2222256.00 2885280.00 Ops Kswapd efficiency % 98.74 91.06 75.73 49.75 Ops Kswapd velocity 7875.25 7973.52 8035.11 9129.89 Ops Direct efficiency % 99.54 98.22 94.47 87.80 Ops Direct velocity 5042.27 4982.09 7238.28 9251.50 Ops Percentage direct scans 39.03 38.46 47.39 50.33 Ops Page writes by reclaim 2291779.00 1689555.00 2222771.00 2885334.00 Ops Page writes file 524.00 1.00 515.00 54.00 Ops Page writes anon 2291255.00 1689554.00 2222256.00 2885280.00 Ops Page reclaim immediate 2548.00 76.00 367.00 710.00 Ops Sector Reads 320255172.00 309217632.00 264732588.00 269864268.00 Ops Sector Writes 63264744.00 60776604.00 62860244.00 65572608.00 Ops Page rescued immediate 0.00 0.00 0.00 0.00 Ops Slabs scanned 595876.00 246334.00 1018425.00 1390506.00 Ops Direct inode steals 49.00 5.00 0.00 8.00 Ops Kswapd inode steals 24118244.00 28126116.00 12866766.00 12943742.00 Ops Kswapd skipped wait 0.00 0.00 0.00 0.00 Ops THP fault alloc 164266.00 204790.00 188055.00 190899.00 Ops THP fault fallback 49345.00 8614.00 25454.00 22650.00 Ops THP collapse alloc 132.00 139.00 116.00 121.00 Ops THP collapse fail 4.00 0.00 1.00 2.00 Ops THP split 15789.00 5642.00 18119.00 49169.00 Ops THP split failed 0.00 0.00 0.00 0.00 Ops Compaction stalls 139794.00 52054.00 214361.00 226514.00 Ops Compaction success 19004.00 17786.00 39430.00 35922.00 Ops Compaction failures 120790.00 34268.00 174931.00 190592.00 Ops Compaction efficiency 13.59 34.17 18.39 15.86 Ops Page migrate success 11427696.00 12758554.00 22010443.00 21668849.00 Ops Page migrate failure 11559690.00 10403623.00 13514889.00 13212313.00 Ops Compaction pages isolated 23217363.00 20560760.00 46945743.00 47574686.00 Ops Compaction migrate scanned 19925995.00 16224351.00 49565044.00 58595534.00 Ops Compaction free scanned 68708150.00 47235800.00 134685089.00 153518780.00 Ops Compact scan efficiency 29.00 34.35 36.80 38.17 Ops Compaction cost 12604.40 13893.25 24274.97 23991.33 Ops Kcompactd wake 100.00 172.00 100.00 258.00 Ops Kcompactd migrate scanned 33135.00 55797.00 113344.00 948026.00 Ops Kcompactd free scanned 335005.00 310265.00 174353.00 686434.00 Ops NUMA alloc hit 98173280.00 77063265.00 82393545.00 70699591.00 Ops NUMA alloc miss 22834593.00 20306614.00 12296731.00 24085033.00 Ops NUMA interleave hit 0.00 0.00 0.00 0.00 Ops NUMA alloc local 98163641.00 77059289.00 82384101.00 70697814.00 Ops NUMA base-page range updates 53170070.00 41093095.00 62202747.00 71769257.00 Ops NUMA PTE updates 15041942.00 2726887.00 12972411.00 21418665.00 Ops NUMA PMD updates 74469.00 74934.00 96153.00 98341.00 Ops NUMA hint faults 11492208.00 2337770.00 9146157.00 16587622.00 Ops NUMA hint local faults % 6465386.00 1080902.00 7087393.00 12336942.00 Ops NUMA hint local percent 56.26 46.24 77.49 74.37 Ops NUMA pages migrated 560888.00 3004903.00 336032.00 195839.00 Ops AutoNUMA cost 57843.89 12033.59 46172.59 83444.22 There is a lot here. Both 5.4.0-rc6 and the tweak had more memory placed locally (NUMA hint local percent). *All* kernels swapped pages in an out with 5.4.0-rc6 being as bad as 5.2.0-vanilla and the patch making this slightly but not considerably worse. 5.3.0 was generally more successful at allocating huge pages (THP fault fallback) with 5.4.0-rc6 being much worse at allocating huge pages and the tweak not making much of a difference. So the patch is a mix of good and bad in this case. However, the test case has notable limitations and more information would be needed from Andrea on exactly how he was evaluating KVM start times. 1. memhog is single threaded which means that one node is likely to be filled first, spillover while the first node reclaims. This means that swapping is somewhat inevitable on NUMA machines with this load. We could run multiple instances but that would have very variable results. 2. Reading a single large file forces reclaim but it definitely is not fragmenting memory. However, we never all agreed on how a machine should be fragmented to be useful for this sort of test. Mentioning hypothetical kernel modules that don't exist is not good enough. Creating massive amounts of small files with fragment memory to some extent but not aggressively. Doing that while memhog is running would help but the results would be very variable. This particular test is hard to reproduce even though it's in mmtests as configs/config-workload-kvmstart-memhog-frag-singlefile because the test relies on KVM helper scripts to deploy, start and stop the virtual machine. These scripts exist outside of mmtests and belong to a set of tools I use to schedule, execute and report on tests across a range of machines. I also ran a THP faulting benchmark with fio running in the background in a configuration that tends to fragment memory (mmtests config workload-thpfioscale-madvhugepage) with ext4 as a backing filesystem for fio. 5.2.0 5.3.0 5.4.0-rc6 5.4.0-rc6 vanilla vanilla vanilla thptweak-v1r1 Min fault-base-5 958.00 ( 0.00%) 1980.00 (-106.68%) 1083.00 ( -13.05%) 1406.00 ( -46.76%) Min fault-huge-5 297.00 ( 0.00%) 309.00 ( -4.04%) 431.00 ( -45.12%) 324.00 ( -9.09%) Min fault-both-5 297.00 ( 0.00%) 309.00 ( -4.04%) 431.00 ( -45.12%) 324.00 ( -9.09%) Amean fault-base-5 2517.81 ( 0.00%) 8886.23 *-252.94%* 4851.98 * -92.71%* 8223.64 *-226.62%* Amean fault-huge-5 2781.67 ( 0.00%) 10397.46 *-273.78%* 3596.91 * -29.31%* 7139.44 *-156.66%* Amean fault-both-5 2662.24 ( 0.00%) 9916.03 *-272.47%* 3990.46 * -49.89%* 7367.48 *-176.74%* Stddev fault-base-5 2713.85 ( 0.00%) 24331.73 (-796.58%) 5530.37 (-103.78%) 27048.99 (-896.70%) Stddev fault-huge-5 3740.46 ( 0.00%) 39529.80 (-956.82%) 5428.68 ( -45.13%) 23418.76 (-526.09%) Stddev fault-both-5 3317.75 ( 0.00%) 35408.44 (-967.24%) 5491.27 ( -65.51%) 24229.15 (-630.29%) CoeffVar fault-base-5 107.79 ( 0.00%) 273.81 (-154.03%) 113.98 ( -5.75%) 328.92 (-205.16%) CoeffVar fault-huge-5 134.47 ( 0.00%) 380.19 (-182.73%) 150.93 ( -12.24%) 328.02 (-143.94%) CoeffVar fault-both-5 124.62 ( 0.00%) 357.08 (-186.53%) 137.61 ( -10.42%) 328.87 (-163.89%) Max fault-base-5 88873.00 ( 0.00%) 386539.00 (-334.93%) 115638.00 ( -30.12%) 486930.00 (-447.89%) Max fault-huge-5 63735.00 ( 0.00%) 602544.00 (-845.39%) 139082.00 (-118.22%) 426777.00 (-569.61%) Max fault-both-5 88873.00 ( 0.00%) 602544.00 (-577.98%) 139082.00 ( -56.50%) 486930.00 (-447.89%) BAmean-50 fault-base-5 1192.71 ( 0.00%) 3101.71 (-160.06%) 2170.91 ( -82.02%) 2692.97 (-125.79%) BAmean-50 fault-huge-5 756.99 ( 0.00%) 972.96 ( -28.53%) 1112.99 ( -47.03%) 1168.70 ( -54.39%) BAmean-50 fault-both-5 953.65 ( 0.00%) 1455.39 ( -52.61%) 1319.56 ( -38.37%) 1375.04 ( -44.19%) BAmean-95 fault-base-5 2109.87 ( 0.00%) 4941.87 (-134.23%) 4056.10 ( -92.24%) 4672.90 (-121.48%) BAmean-95 fault-huge-5 2158.64 ( 0.00%) 3867.03 ( -79.14%) 2766.41 ( -28.16%) 3395.88 ( -57.32%) BAmean-95 fault-both-5 2127.00 ( 0.00%) 4183.64 ( -96.69%) 3169.07 ( -48.99%) 3666.57 ( -72.38%) BAmean-99 fault-base-5 2349.85 ( 0.00%) 6811.21 (-189.86%) 4512.17 ( -92.02%) 5738.18 (-144.19%) BAmean-99 fault-huge-5 2538.90 ( 0.00%) 7109.96 (-180.04%) 3225.96 ( -27.06%) 5194.97 (-104.61%) BAmean-99 fault-both-5 2443.68 ( 0.00%) 6952.77 (-184.52%) 3626.66 ( -48.41%) 5300.01 (-116.89%) thpfioscale Percentage Faults Huge 5.2.0 5.3.0 5.4.0-rc6 5.4.0-rc6 vanilla vanilla vanilla thptweak-v1r1 Percentage huge-5 54.74 ( 0.00%) 68.14 ( 24.49%) 68.64 ( 25.40%) 78.97 ( 44.27%) In this case, 5.4.0-rc6 has lower latency when allocating pages whether THP is used or a fallback to base pages. The patch has latency somewhere between 5.2 and 5.4-rc6 indicating that more effort is being made. However, as expected the allocation success rate of the patch is higher than all the others. This indicates the higher latency is due to the additional work done to allocate the page. Finally I ran a benchmark that faults memory in such a way as to expect high compaction activity usemem 5.2.0 5.3.0 5.4.0-rc6 5.4.0-rc6 vanilla vanilla vanilla thptweak-v1r1 Amean elsp-1 42.09 ( 0.00%) 26.59 * 36.84%* 36.75 ( 12.70%) 40.23 ( 4.42%) Amean elsp-3 32.96 ( 0.00%) 7.48 * 77.29%* 8.73 * 73.50%* 11.49 * 65.15%* Amean elsp-4 5.69 ( 0.00%) 5.85 * -2.81%* 5.68 ( 0.18%) 5.68 ( 0.18%) 5.2.0 5.3.0 5.4.0-rc6 5.4.0-rc6 vanilla vanilla vanilla thptweak-v1r1 Ops Swap Ins 2937652.00 0.00 695423.00 224792.00 Ops Swap Outs 3647757.00 0.00 797035.00 262654.00 This is showing that 5.3.0 does not swap at all and is the fastest, 5.4.0-rc6 introduced swapping and the patch reduced it somewhat. So, the patch behaves more or less as expected and I'm not seeing clear evidence that the patch makes swapping more likely as feared by David. However, the kvm testcase is very limited and needs more information on exactly how Andrea was testing it to induce swap storms. I can easily make guesses but chances are that I'll guess wrong.
On Wed, 13 Nov 2019, Mel Gorman wrote: > > > The whole point of the Vlastimil's patch is to have an optimistic local > > > node allocation first and the full gfp context one in the fallback path. > > > If our full gfp context doesn't really work well then we can revisit > > > that of course but that should happen at alloc_hugepage_direct_gfpmask > > > level. > > > > Since the patch reverts the precaution put into the page allocator to not > > attempt reclaim if the allocation order is significantly large and the > > return value from compaction specifies it is unlikely to succed on its > > own, I believe Vlastimil's patch will cause the same regression that > > Andrea saw is the whole host is low on memory and/or significantly > > fragmented. So the suggestion was that he test this change to make sure > > we aren't introducing a regression for his workload. > > TLDR: I do not have evidence that Vlastimil's patch causes more swapping > but more information is needed from Andrea on exactly how he's > testing this. It's not clear to me what was originally tested > and whether memory just had to be full or whether it had to be > fragmented. If fragmented, then we have to agree on what an > appropriate mechanism is for fragmenting memory. Hypothetical > kernel modules that don't exist do not count. > > I put together a testcase whereby a virtual machine is deployed, started > and then time how long it takes to run memhog on 80% of the guests > physical memory. I varied how large the virtual machine is and ran it on > a 2-socket machine so that the smaller tests would be single node and > the larger tests would span both nodes. Before each startup, a large > file is read to fill the memory with pagecache. > First, thanks very much for the follow-up and considerable amount of time testing and benchmarking this. I, like you, do not have a reliable test case that will reproduce the issue that Andrea initially reported over a year ago. I believe in the discussion that repeatedly referred to swap storms that, with the __GFP_THISNODE policy, we were not thrashing because the local node was low on memory due to page cache. How memory is filled with page cache will naturally effect how it can be reclaimed when compaction fails locally, I don't know if it's an accurate representation of the initial problem. I also don't recall details about the swapfile or exactly where we were contending while trying to fault local hugepages. My concern, and it's only a concern at this point and not a regression report because we don't have a result from Andrea, is that the result of this patch is that the second allocation in alloc_pages_vma() enables the exact same allocation policy that Andrea reported was a problem earlier if __GFP_DIRECT_RECLAIM is set, which it will be as a result of qemu's use of MADV_HUGEPAGE. That reclaim and compaction is now done over the entire system and not isolated only to the local node so there are two plausible outcomes: (1) the remote note is not fragmented and we can easily fault a remote hugepage or (2) we thrash and cause swap storms remotely as well as locally. (1) is the outcome that Andrea is seeking based on the initial reverts: that much we know. So my concern is that if the *system* is fragmented that we have now introduced a much more significant swap storm that will result in a much more serious regression. So my question would be: if we know the previous behavior that allowed excessive swap and recalling into compaction was deemed harmful for the local node, why do we now believe it cannot be harmful if done for all system memory? The patch subverts the precaution put into place in the page allocator to specifically not do this excessive reclaim and recall into compaction dance and I believe restores the previous bad behavior if remote memory is similarly fragmented. (What prevents this??) Andrea was able to test this on several kernel versions with a fragmented local node so I *assume* it would not be difficult to measure the extent to which this patch can become harmful if all memory is fragmented. I'm hoping that we can quantify that potentially negative impact before opening users up to the possibility. As you said, it behaves better on some systems and workloads and worse on others and we both agree more information is needed. I think asking Andrea to test and quantify the change with a fragmented system would help us to make a more informed decision and not add a potential regression to 5.5 or whichever kernel this would be merged in.
On Sun 24-11-19 16:10:53, David Rientjes wrote: [...] > So my question would be: if we know the previous behavior that allowed > excessive swap and recalling into compaction was deemed harmful for the > local node, why do we now believe it cannot be harmful if done for all > system memory? I have to say that I got lost in your explanation. I have already pointed this out in a previous email you didn't reply to. But the main difference to previous __GFP_THISNODE behavior is that it is used along with __GFP_NORETRY and that reduces the overall effort of the reclaim AFAIU. If that is not the case then please be _explicit_ why. Having test results from Andrea would be really appreciated of course but he seems to be too busy to do that (or maybe not interested anymore). I do not see any real reason to hold on this patch based on hand waving though. So either we have some good reasoning to argue against the patch or a good testing results or we should go ahead. As things stand right now, THP success rate went down after your last changes for _very simple_ workloads. This needs addressing which I hope we do agree on.
On Mon, 25 Nov 2019, Michal Hocko wrote: > > So my question would be: if we know the previous behavior that allowed > > excessive swap and recalling into compaction was deemed harmful for the > > local node, why do we now believe it cannot be harmful if done for all > > system memory? > > I have to say that I got lost in your explanation. I have already > pointed this out in a previous email you didn't reply to. But the main > difference to previous __GFP_THISNODE behavior is that it is used along > with __GFP_NORETRY and that reduces the overall effort of the reclaim > AFAIU. If that is not the case then please be _explicit_ why. > I'm referring to the second allocation in alloc_pages_vma() after the patch: /* * If hugepage allocations are configured to always * synchronous compact or the vma has been madvised * to prefer hugepage backing, retry allowing remote - * memory as well. + * memory with both reclaim and compact as well. */ if (!page && (gfp & __GFP_DIRECT_RECLAIM)) page = __alloc_pages_node(hpage_node, - gfp | __GFP_NORETRY, order); + gfp, order); So we now do not have __GFP_NORETRY nor __GFP_THISNODE so this bypasses all the precautionary logic in the page allocator that avoids excessive swap: it is free to continue looping, swapping, and thrashing, trying to allocate hugepages if all memory is fragmented. Qemu uses MADV_HUGEPAGE so this allocation *will* be attempted for Andrea's workload. The swap storms were reported for the same allocation but with __GFP_THISNODE so it only occurred for local fragmentation and low-on-memory conditions for the local node in the past. This is now opened up for all nodes. So the question is: what prevents the exact same issue from happening again for Andrea's usecase if all memory on the system is fragmented? I'm assuming that if this were tested under such conditions that the swap storms would be much worse.
On 11/25/19 9:38 PM, David Rientjes wrote: > On Mon, 25 Nov 2019, Michal Hocko wrote: > >>> So my question would be: if we know the previous behavior that allowed >>> excessive swap and recalling into compaction was deemed harmful for the >>> local node, why do we now believe it cannot be harmful if done for all >>> system memory? >> >> I have to say that I got lost in your explanation. I have already >> pointed this out in a previous email you didn't reply to. But the main >> difference to previous __GFP_THISNODE behavior is that it is used along >> with __GFP_NORETRY and that reduces the overall effort of the reclaim >> AFAIU. If that is not the case then please be _explicit_ why. >> > > I'm referring to the second allocation in alloc_pages_vma() after the > patch: > > /* > * If hugepage allocations are configured to always > * synchronous compact or the vma has been madvised > * to prefer hugepage backing, retry allowing remote > - * memory as well. > + * memory with both reclaim and compact as well. > */ > if (!page && (gfp & __GFP_DIRECT_RECLAIM)) > page = __alloc_pages_node(hpage_node, > - gfp | __GFP_NORETRY, order); > + gfp, order); > > So we now do not have __GFP_NORETRY nor __GFP_THISNODE so this bypasses > all the precautionary logic in the page allocator that avoids excessive > swap: it is free to continue looping, swapping, and thrashing, trying to > allocate hugepages if all memory is fragmented. Well, it's not completely free to do all that, there's other reclaim/compaction logic that terminates the attempts. If it's insufficient for some workloads, I would like to know, with hard data. I think the past observations of thrashing with __GFP_THISNODE do not prove the reclaim/compaction logic is wrong, see below. > Qemu uses MADV_HUGEPAGE so this allocation *will* be attempted for > Andrea's workload. The swap storms were reported for the same allocation > but with __GFP_THISNODE so it only occurred for local fragmentation and > low-on-memory conditions for the local node in the past. This is now > opened up for all nodes. I think it's also possible to explain the thrashing with __GFP_THISNODE without the premise of local fragmentation and suboptimal reclaim/compact decisions. THP __GFP_THISNODE allocations might simply put too much pressure on the local node with a workload that might be sized for the whole system. It's the same problem as with the node reclaim mode. Even if there's no bad fragmentation and THP allocations succeed without much trouble, they fill the node and cause reclaim (both kswapd and direct). Fallback order-0 allocations are not restricted, so they will use all nodes. The new THP retry without __GFP_THISNODE is also not restricted, so it's very much possible it will not cause this thrashing with a properly sized workload for the whole system. > So the question is: what prevents the exact same issue from happening > again for Andrea's usecase if all memory on the system is fragmented? I'm > assuming that if this were tested under such conditions that the swap > storms would be much worse. We will only know if Andrea tests it. But if his workloads were fine with just the initial __GFP_THISNODE reverts, they should be still fine after this patch?