Message ID | 20180925120326.24392-2-mhocko@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | thp nodereclaim fixes | expand |
On Tue, Sep 25, 2018 at 02:03:25PM +0200, Michal Hocko wrote: > From: Andrea Arcangeli <aarcange@redhat.com> > > THP allocation might be really disruptive when allocated on NUMA system > with the local node full or hard to reclaim. Stefan has posted an > allocation stall report on 4.12 based SLES kernel which suggests the > same issue: > > [245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null) > [245513.363983] kvm cpuset=/ mems_allowed=0-1 > [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased) > [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017 > [245513.365905] Call Trace: > [245513.366535] dump_stack+0x5c/0x84 > [245513.367148] warn_alloc+0xe0/0x180 > [245513.367769] __alloc_pages_slowpath+0x820/0xc90 > [245513.368406] ? __slab_free+0xa9/0x2f0 > [245513.369048] ? __slab_free+0xa9/0x2f0 > [245513.369671] __alloc_pages_nodemask+0x1cc/0x210 > [245513.370300] alloc_pages_vma+0x1e5/0x280 > [245513.370921] do_huge_pmd_wp_page+0x83f/0xf00 > [245513.371554] ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0 > [245513.372184] ? do_huge_pmd_anonymous_page+0x631/0x6d0 > [245513.372812] __handle_mm_fault+0x93d/0x1060 > [245513.373439] handle_mm_fault+0xc6/0x1b0 > [245513.374042] __do_page_fault+0x230/0x430 > [245513.374679] ? get_vtime_delta+0x13/0xb0 > [245513.375411] do_page_fault+0x2a/0x70 > [245513.376145] ? page_fault+0x65/0x80 > [245513.376882] page_fault+0x7b/0x80 > [...] > [245513.382056] Mem-Info: > [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5 > active_file:60183 inactive_file:245285 isolated_file:0 > unevictable:15657 dirty:286 writeback:1 unstable:0 > slab_reclaimable:75543 slab_unreclaimable:2509111 > mapped:81814 shmem:31764 pagetables:370616 bounce:0 > free:32294031 free_pcp:6233 free_cma:0 > [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no > [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no > > The defrag mode is "madvise" and from the above report it is clear that > the THP has been allocated for MADV_HUGEPAGA vma. > > Andrea has identified that the main source of the problem is > __GFP_THISNODE usage: > > : The problem is that direct compaction combined with the NUMA > : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very > : hard the local node, instead of failing the allocation if there's no > : THP available in the local node. > : > : Such logic was ok until __GFP_THISNODE was added to the THP allocation > : path even with MPOL_DEFAULT. > : > : The idea behind the __GFP_THISNODE addition, is that it is better to > : provide local memory in PAGE_SIZE units than to use remote NUMA THP > : backed memory. That largely depends on the remote latency though, on > : threadrippers for example the overhead is relatively low in my > : experience. > : > : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in > : extremely slow qemu startup with vfio, if the VM is larger than the > : size of one host NUMA node. This is because it will try very hard to > : unsuccessfully swapout get_user_pages pinned pages as result of the > : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE > : allocations and instead of trying to allocate THP on other nodes (it > : would be even worse without vfio type1 GUP pins of course, except it'd > : be swapping heavily instead). > > Fix this by removing __GFP_THISNODE for THP requests which are > requesting the direct reclaim. This effectivelly reverts 5265047ac301 on > the grounds that the zone/node reclaim was known to be disruptive due > to premature reclaim when there was memory free. While it made sense at > the time for HPC workloads without NUMA awareness on rare machines, it > was ultimately harmful in the majority of cases. The existing behaviour > is similiar, if not as widespare as it applies to a corner case but > crucially, it cannot be tuned around like zone_reclaim_mode can. The > default behaviour should always be to cause the least harm for the > common case. > > If there are specialised use cases out there that want zone_reclaim_mode > in specific cases, then it can be built on top. Longterm we should > consider a memory policy which allows for the node reclaim like behavior > for the specific memory ranges which would allow a > > [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com > > [mhocko@suse.com: rewrote the changelog based on the one from Andrea] > Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node") > Cc: Zi Yan <zi.yan@cs.rutgers.edu> > Cc: stable # 4.1+ > Reported-by: Stefan Priebe <s.priebe@profihost.ag> > Debugged-by: Andrea Arcangeli <aarcange@redhat.com> > Reported-by: Alex Williamson <alex.williamson@redhat.com> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> > Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mel Gorman <mgorman@techsingularity.net> Both patches look correct to me but I'm responding to this one because it's the fix. The change makes sense and moves further away from the severe stalling behaviour we used to see with both THP and zone reclaim mode. I put together a basic experiment with usemem configured to reference a buffer multiple times that is 80% the size of main memory on a 2-socket box with symmetric node sizes and defrag set to "always". The defrag setting is not the default but it would be functionally similar to accessing a buffer with madvise(MADV_HUGEPAGE). Usemem is configured to reference the buffer multiple times and while it's not an interesting workload, it would be expected to complete reasonably quickly as it fits within memory. The results were; usemem vanilla noreclaim-v1 Amean Elapsd-1 42.78 ( 0.00%) 26.87 ( 37.18%) Amean Elapsd-3 27.55 ( 0.00%) 7.44 ( 73.00%) Amean Elapsd-4 5.72 ( 0.00%) 5.69 ( 0.45%) This shows the elapsed time in seconds for 1 thread, 3 threads and 4 threads referencing buffers 80% the size of memory. With the patches applied, it's 37.18% faster for the single thread and 73% faster with two threads. Note that 4 threads showing little difference does not indicate the problem is related to thread counts. It's simply the case that 4 threads gets spread so their workload mostly fits in one node. The overall view from /proc/vmstats is more startling 4.19.0-rc1 4.19.0-rc1 vanillanoreclaim-v1r1 Minor Faults 35593425 708164 Major Faults 484088 36 Swap Ins 3772837 0 Swap Outs 3932295 0 Massive amounts of swap in/out without the patch Direct pages scanned 6013214 0 Kswapd pages scanned 0 0 Kswapd pages reclaimed 0 0 Direct pages reclaimed 4033009 0 Lots of reclaim activity without the patch Kswapd efficiency 100% 100% Kswapd velocity 0.000 0.000 Direct efficiency 67% 100% Direct velocity 11191.956 0.000 Mostly from direct reclaim context as you'd expect without the patch. Page writes by reclaim 3932314.000 0.000 Page writes file 19 0 Page writes anon 3932295 0 Page reclaim immediate 42336 0 Writes from reclaim context is never good but the patch eliminates it. We should never have default behaviour to thrash the system for such a basic workload. If zone reclaim mode behaviour is ever desired but on a single task instead of a global basis then the sensible option is to build a mempolicy that enforces that behaviour.
On Tue 25-09-18 13:20:08, Mel Gorman wrote: > On Tue, Sep 25, 2018 at 02:03:25PM +0200, Michal Hocko wrote: > > From: Andrea Arcangeli <aarcange@redhat.com> > > > > THP allocation might be really disruptive when allocated on NUMA system > > with the local node full or hard to reclaim. Stefan has posted an > > allocation stall report on 4.12 based SLES kernel which suggests the > > same issue: > > > > [245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null) > > [245513.363983] kvm cpuset=/ mems_allowed=0-1 > > [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased) > > [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017 > > [245513.365905] Call Trace: > > [245513.366535] dump_stack+0x5c/0x84 > > [245513.367148] warn_alloc+0xe0/0x180 > > [245513.367769] __alloc_pages_slowpath+0x820/0xc90 > > [245513.368406] ? __slab_free+0xa9/0x2f0 > > [245513.369048] ? __slab_free+0xa9/0x2f0 > > [245513.369671] __alloc_pages_nodemask+0x1cc/0x210 > > [245513.370300] alloc_pages_vma+0x1e5/0x280 > > [245513.370921] do_huge_pmd_wp_page+0x83f/0xf00 > > [245513.371554] ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0 > > [245513.372184] ? do_huge_pmd_anonymous_page+0x631/0x6d0 > > [245513.372812] __handle_mm_fault+0x93d/0x1060 > > [245513.373439] handle_mm_fault+0xc6/0x1b0 > > [245513.374042] __do_page_fault+0x230/0x430 > > [245513.374679] ? get_vtime_delta+0x13/0xb0 > > [245513.375411] do_page_fault+0x2a/0x70 > > [245513.376145] ? page_fault+0x65/0x80 > > [245513.376882] page_fault+0x7b/0x80 > > [...] > > [245513.382056] Mem-Info: > > [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5 > > active_file:60183 inactive_file:245285 isolated_file:0 > > unevictable:15657 dirty:286 writeback:1 unstable:0 > > slab_reclaimable:75543 slab_unreclaimable:2509111 > > mapped:81814 shmem:31764 pagetables:370616 bounce:0 > > free:32294031 free_pcp:6233 free_cma:0 > > [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no > > [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no > > > > The defrag mode is "madvise" and from the above report it is clear that > > the THP has been allocated for MADV_HUGEPAGA vma. > > > > Andrea has identified that the main source of the problem is > > __GFP_THISNODE usage: > > > > : The problem is that direct compaction combined with the NUMA > > : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very > > : hard the local node, instead of failing the allocation if there's no > > : THP available in the local node. > > : > > : Such logic was ok until __GFP_THISNODE was added to the THP allocation > > : path even with MPOL_DEFAULT. > > : > > : The idea behind the __GFP_THISNODE addition, is that it is better to > > : provide local memory in PAGE_SIZE units than to use remote NUMA THP > > : backed memory. That largely depends on the remote latency though, on > > : threadrippers for example the overhead is relatively low in my > > : experience. > > : > > : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in > > : extremely slow qemu startup with vfio, if the VM is larger than the > > : size of one host NUMA node. This is because it will try very hard to > > : unsuccessfully swapout get_user_pages pinned pages as result of the > > : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE > > : allocations and instead of trying to allocate THP on other nodes (it > > : would be even worse without vfio type1 GUP pins of course, except it'd > > : be swapping heavily instead). > > > > Fix this by removing __GFP_THISNODE for THP requests which are > > requesting the direct reclaim. This effectivelly reverts 5265047ac301 on > > the grounds that the zone/node reclaim was known to be disruptive due > > to premature reclaim when there was memory free. While it made sense at > > the time for HPC workloads without NUMA awareness on rare machines, it > > was ultimately harmful in the majority of cases. The existing behaviour > > is similiar, if not as widespare as it applies to a corner case but > > crucially, it cannot be tuned around like zone_reclaim_mode can. The > > default behaviour should always be to cause the least harm for the > > common case. > > > > If there are specialised use cases out there that want zone_reclaim_mode > > in specific cases, then it can be built on top. Longterm we should > > consider a memory policy which allows for the node reclaim like behavior > > for the specific memory ranges which would allow a > > > > [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com > > > > [mhocko@suse.com: rewrote the changelog based on the one from Andrea] > > Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node") > > Cc: Zi Yan <zi.yan@cs.rutgers.edu> > > Cc: stable # 4.1+ > > Reported-by: Stefan Priebe <s.priebe@profihost.ag> > > Debugged-by: Andrea Arcangeli <aarcange@redhat.com> > > Reported-by: Alex Williamson <alex.williamson@redhat.com> > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > Reviewed-by: Mel Gorman <mgorman@techsingularity.net> > > Both patches look correct to me but I'm responding to this one because > it's the fix. The change makes sense and moves further away from the > severe stalling behaviour we used to see with both THP and zone reclaim > mode. > > I put together a basic experiment with usemem configured to reference a > buffer multiple times that is 80% the size of main memory on a 2-socket box > with symmetric node sizes and defrag set to "always". The defrag setting > is not the default but it would be functionally similar to accessing a > buffer with madvise(MADV_HUGEPAGE). Usemem is configured to reference > the buffer multiple times and while it's not an interesting workload, > it would be expected to complete reasonably quickly as it fits within > memory. The results were; > > usemem > vanilla noreclaim-v1 > Amean Elapsd-1 42.78 ( 0.00%) 26.87 ( 37.18%) > Amean Elapsd-3 27.55 ( 0.00%) 7.44 ( 73.00%) > Amean Elapsd-4 5.72 ( 0.00%) 5.69 ( 0.45%) > > This shows the elapsed time in seconds for 1 thread, 3 threads and 4 threads > referencing buffers 80% the size of memory. With the patches applied, it's > 37.18% faster for the single thread and 73% faster with two threads. Note > that 4 threads showing little difference does not indicate the problem is > related to thread counts. It's simply the case that 4 threads gets spread > so their workload mostly fits in one node. > > The overall view from /proc/vmstats is more startling > > 4.19.0-rc1 4.19.0-rc1 > vanillanoreclaim-v1r1 > Minor Faults 35593425 708164 > Major Faults 484088 36 > Swap Ins 3772837 0 > Swap Outs 3932295 0 > > Massive amounts of swap in/out without the patch > > Direct pages scanned 6013214 0 > Kswapd pages scanned 0 0 > Kswapd pages reclaimed 0 0 > Direct pages reclaimed 4033009 0 > > Lots of reclaim activity without the patch > > Kswapd efficiency 100% 100% > Kswapd velocity 0.000 0.000 > Direct efficiency 67% 100% > Direct velocity 11191.956 0.000 > > Mostly from direct reclaim context as you'd expect without the patch. > > Page writes by reclaim 3932314.000 0.000 > Page writes file 19 0 > Page writes anon 3932295 0 > Page reclaim immediate 42336 0 > > Writes from reclaim context is never good but the patch eliminates it. > > We should never have default behaviour to thrash the system for such a > basic workload. If zone reclaim mode behaviour is ever desired but on a > single task instead of a global basis then the sensible option is to build > a mempolicy that enforces that behaviour. Thanks a lot for numbers Mel!
On Tue, 25 Sep 2018, Michal Hocko wrote: > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index da858f794eb6..149b6f4cf023 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2046,8 +2046,36 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, > nmask = policy_nodemask(gfp, pol); > if (!nmask || node_isset(hpage_node, *nmask)) { > mpol_cond_put(pol); > - page = __alloc_pages_node(hpage_node, > - gfp | __GFP_THISNODE, order); > + /* > + * We cannot invoke reclaim if __GFP_THISNODE > + * is set. Invoking reclaim with > + * __GFP_THISNODE set, would cause THP > + * allocations to trigger heavy swapping > + * despite there may be tons of free memory > + * (including potentially plenty of THP > + * already available in the buddy) on all the > + * other NUMA nodes. > + * > + * At most we could invoke compaction when > + * __GFP_THISNODE is set (but we would need to > + * refrain from invoking reclaim even if > + * compaction returned COMPACT_SKIPPED because > + * there wasn't not enough memory to succeed > + * compaction). For now just avoid > + * __GFP_THISNODE instead of limiting the > + * allocation path to a strict and single > + * compaction invocation. > + * > + * Supposedly if direct reclaim was enabled by > + * the caller, the app prefers THP regardless > + * of the node it comes from so this would be > + * more desiderable behavior than only > + * providing THP originated from the local > + * node in such case. > + */ > + if (!(gfp & __GFP_DIRECT_RECLAIM)) > + gfp |= __GFP_THISNODE; > + page = __alloc_pages_node(hpage_node, gfp, order); > goto out; > } > } This causes, on average, a 13.9% access latency regression on Haswell, and the regression would likely be more severe on Naples and Rome. There exist libraries that allow the .text segment of processes to be remapped to memory backed by transparent hugepages and use MADV_HUGEPAGE to stress local compaction to defragment node local memory for hugepages at startup. The cost, including the statistics Mel gathered, is acceptable for these processes: they are not concerned with startup cost, they are concerned only with optimal access latency while they are running. So while it may take longer to start the process because memory compaction is attempting to allocate hugepages with __GFP_DIRECT_RECLAIM, in the cases where compaction is successful, this is a very significant long-term win. In cases where compaction fails, falling back to local pages of the native page size instead of remote thp is a win for the remaining time this process wins: as stated, 13.9% faster for all memory accesses to the process's text while it runs on Haswell. So these processes, and there are 100's of users of these libraries, explicitly want to incur the cost of startup to attempt to get local thp and fallback only when necessary to local pages of the native page size. They want the behavior this patch is changing and regress if the patch is merged. This patch does not provide an alternative beyond MPOL_BIND to preserve the existing behavior. In that case, if local memory is actually oom, we unnecessarily oom kill processes which would be completely unacceptable to these users since they are fine to accept 10-20% of their memory being faulted remotely if necessary to prevent processes being oom killed. These libraries were implemented with the existing behavior of MADV_HUGEPAGE in mind and I cannot ask for 100s of binaries to be rebuilt to enforce new constraints specifically for hugepages with no alternative that does not allow unnecessary oom killing. There are ways to address this without introducing regressions for existing users of MADV_HUGEPAGE: introduce an madvise() mode to accept remote thp allocations, which users of this library would never set, or fix memory compaction so that it does not incur substantial allocation latency when it will likely fail. We don't introduce 13.9% regressions for binaries that are correctly using MADV_HUGEPAGE as it is implemented. Nacked-by: David Rientjes <rientjes@google.com>
Hello David, On Thu, Oct 04, 2018 at 01:16:32PM -0700, David Rientjes wrote: > There are ways to address this without introducing regressions for > existing users of MADV_HUGEPAGE: introduce an madvise() mode to accept > remote thp allocations, which users of this library would never set, or > fix memory compaction so that it does not incur substantial allocation > latency when it will likely fail. These librarians needs to call a new MADV_ and the current MADV_HUGEPAGE should not be affected because the new MADV_ will require some capbility (i.e. root privilege). qemu was the first user of MADV_HUGEPAGE and I don't think it's fair to break it and require change to it to run at higher privilege to retain the direct compaction behavior of MADV_HUGEPAGE. The new behavior you ask to retain in MADV_HUGEPAGE, generated the same misbehavior to VM as mlock could have done too, so it can't just be given by default without any privilege whatsoever. Ok you could mitigate the breakage that MADV_HUGEPAGE could have generated (before the recent fix) by isolating malicious or inefficient programs with memcg, but by default in a multiuser system without cgroups the global disruption provided before the fix (i.e. the pathological THP behavior) is not warranted. memcg shouldn't be mandatory to avoid a process to affect the VM in such a strong way (i.e. all other processes who happened to be allocated in the node where the THP allocation triggered, being trashed in swap like if all memory of all other nodes was not completely free). Not only that, it's not only about malicious processes it's also excessively inefficient for processes that just don't fit in a local node and use MADV_HUGEPAGE. Your processes all fit in the local node for sure if they're happy about it. This was reported as a "pathological THP regression" after all in a workload that couldn't swap at all because of the iommu gup persistent refcount pins. The alternative patch I posted which still invoked direct reclaim locally, and falled back to NUMA-local PAGE_SIZEd allocations instead of falling back to NUMA-remote THP, could have been given without privilege, but it still won't swapout as this patch would have done, so it still won't go as far as the MADV_HUGEPAGE behavior had before the fix (for the lib users that strongly needed local memory). Overall I think the call about the default behavior of MADV_HUGEPAGE is still between removing __GFP_THISNODE if gfp_flags can reclaim (the fix in -mm), or by changing direct compaction to only call compaction and not reclaim (i.e. __GFP_COMPACT_ONLY) when __GFP_THISNODE is set. To go beyond that some privilege is needed and a new MADV_ flag can require privilege or return error if there's not enough privilege. So the lib with 100's users can try to use that new flag first, show an error in stderr (maybe under debug), and fallback to MADV_HUGEPAGE if the app hasn't enough privilege. The alternative is to add a new mem policy less strict than MPOL_BIND to achieve what you need on top of MADV_HUGEPAGE (which also would require some privilege of course as all mbinds). I assume you already evaluated the preferred and local mbinds and it's not a perfect fit? If we keep this as a new MADV_HUGEPAGE_FORCE_LOCAL flag, you could still add a THP sysfs/sysctl control to lift the privilege requirement marking it as insecure setting in docs (mm/transparent_hugepage/madv_hugepage_force_local=0|1 forced to 0 by default). This would be on the same lines of other sysctl that increase the max number of files open and such things (perhaps a sysctl would be better in fact for tuning in /etc/sysctl.conf). Note there was still some improvement left possible in my __GFP_COMPACT_ONLY patch alternative. Notably if the watermarks for the local node shown the local node not to have enough real "free" PAGE_SIZEd pages to succeed the PAGE_SIZEd local THP allocation if compaction failed, we should have relaxed __GFP_THISNODE and tried to allocate THP from the NUMA-remote nodes before falling back to PAGE_SIZEd allocations. That also won't require any new privilege. To get the same behavior as before though you would need to drop all caches with echo 3 >drop_caches before the app starts (and it still won't swap anon memory which previous MADV_HUGEPAGE behavior would have, but the whole point is that the previous MADV_HUGEPAGE behavior would have backfired for the lib too if the app was allocating so much RAM from the local node that it required swapouts of local anon memory). Thanks, Andrea
On Thu, 4 Oct 2018, Andrea Arcangeli wrote: > Hello David, > Hi Andrea, > On Thu, Oct 04, 2018 at 01:16:32PM -0700, David Rientjes wrote: > > There are ways to address this without introducing regressions for > > existing users of MADV_HUGEPAGE: introduce an madvise() mode to accept > > remote thp allocations, which users of this library would never set, or > > fix memory compaction so that it does not incur substantial allocation > > latency when it will likely fail. > > These librarians needs to call a new MADV_ and the current > MADV_HUGEPAGE should not be affected because the new MADV_ will > require some capbility (i.e. root privilege). > > qemu was the first user of MADV_HUGEPAGE and I don't think it's fair > to break it and require change to it to run at higher privilege to > retain the direct compaction behavior of MADV_HUGEPAGE. > > The new behavior you ask to retain in MADV_HUGEPAGE, generated the > same misbehavior to VM as mlock could have done too, so it can't just > be given by default without any privilege whatsoever. > > Ok you could mitigate the breakage that MADV_HUGEPAGE could have > generated (before the recent fix) by isolating malicious or > inefficient programs with memcg, but by default in a multiuser system > without cgroups the global disruption provided before the fix > (i.e. the pathological THP behavior) is not warranted. memcg shouldn't > be mandatory to avoid a process to affect the VM in such a strong way > (i.e. all other processes who happened to be allocated in the node > where the THP allocation triggered, being trashed in swap like if all > memory of all other nodes was not completely free). > The source of the problem needs to be addressed: memory compaction. We regress because we lose __GFP_NORETRY and pointlessly try reclaim, but deferred compaction is supposedly going to prevent repeated (and unnecessary) calls to memory compaction that ends up thrashing your local node. This is likely because your workload has a size greater than 2MB * the deferred compaction threshold, normally set at 64. This ends up repeatedly calling memory compaction and ending up being expensive when it should fail once and not be called again in the near term. But that's a memory compaction issue, not a thp gfp mask issue; the reclaim issue is responded to below. > Not only that, it's not only about malicious processes it's also > excessively inefficient for processes that just don't fit in a local > node and use MADV_HUGEPAGE. Your processes all fit in the local node > for sure if they're happy about it. This was reported as a > "pathological THP regression" after all in a workload that couldn't > swap at all because of the iommu gup persistent refcount pins. > This patch causes an even worse regression if all system memory is fragmented such that thp cannot be allocated because it tries to stress compaction on remote nodes, which ends up unsuccessfully, not just the local node. On Haswell, when all memory is fragmented (not just the local node as I obtained by 13.9% regression result), the patch results in a fault latency regression of 40.9% for MADV_HUGEPAGE region of 8GB. This is because it is thrashing both nodes pointlessly instead of just failing for __GFP_THISNODE. So the end result is that the patch regresses access latency forever by 13.9% when the local node is fragmented because it is accessing remote thp vs local pages of the native page size, and regresses fault latency of 40.9% when the system is fully fragmented. The only time that fault latency is improved is when remote memory is not fully fragmented, but then you must incur the remote access latency. > Overall I think the call about the default behavior of MADV_HUGEPAGE > is still between removing __GFP_THISNODE if gfp_flags can reclaim (the > fix in -mm), or by changing direct compaction to only call compaction > and not reclaim (i.e. __GFP_COMPACT_ONLY) when __GFP_THISNODE is set. > There's two issues: the expensiveness of the page allocator involving compaction for MADV_HUGEPAGE mappings and the desire for userspace to fault thp remotely and incur the 13.9% performance regression forever. If reclaim is avoided like it should be with __GFP_NORETRY for even MADV_HUGEPAGE regions, you should only experience latency introduced by node local memory compaction. The __GFP_NORETRY was removed by commit 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations"). The current implementation of the page allocator does not match the expected behavior of the thp gfp flags. Memory compaction has deferred compaction to avoid costly scanning when it has recently failed, and that likely needs to be addressed directly rather than relying on a count of how many times it has failed; if you fault more than 128MB at the same time, does it make sense to immediately compact again? Likely not. > To go beyond that some privilege is needed and a new MADV_ flag can > require privilege or return error if there's not enough privilege. So > the lib with 100's users can try to use that new flag first, show an > error in stderr (maybe under debug), and fallback to MADV_HUGEPAGE if > the app hasn't enough privilege. The alternative is to add a new mem > policy less strict than MPOL_BIND to achieve what you need on top of > MADV_HUGEPAGE (which also would require some privilege of course as > all mbinds). I assume you already evaluated the preferred and local > mbinds and it's not a perfect fit? > > If we keep this as a new MADV_HUGEPAGE_FORCE_LOCAL flag, you could > still add a THP sysfs/sysctl control to lift the privilege requirement > marking it as insecure setting in docs > (mm/transparent_hugepage/madv_hugepage_force_local=0|1 forced to 0 by > default). This would be on the same lines of other sysctl that > increase the max number of files open and such things (perhaps a > sysctl would be better in fact for tuning in /etc/sysctl.conf). > > Note there was still some improvement left possible in my > __GFP_COMPACT_ONLY patch alternative. Notably if the watermarks for > the local node shown the local node not to have enough real "free" > PAGE_SIZEd pages to succeed the PAGE_SIZEd local THP allocation if > compaction failed, we should have relaxed __GFP_THISNODE and tried to > allocate THP from the NUMA-remote nodes before falling back to > PAGE_SIZEd allocations. That also won't require any new privilege. > Direct reclaim doesn't make much sense for thp allocations if compaction has failed, even for MADV_HUGEPAGE. I've discounted Mel's results because he is using thp defrag set to "always", which includes __GFP_NORETRY but the default option and anything else other than "always" does not use __GFP_NORETRY like the page allocator believes it does: /* * Checks for costly allocations with __GFP_NORETRY, which * includes THP page fault allocations */ if (costly_order && (gfp_mask & __GFP_NORETRY)) { /* * If compaction is deferred for high-order allocations, * it is because sync compaction recently failed. If * this is the case and the caller requested a THP * allocation, we do not want to heavily disrupt the * system, so we fail the allocation instead of entering * direct reclaim. */ if (compact_result == COMPACT_DEFERRED) goto nopage; So he is avoiding the cost of reclaim, which you are not, specifically because he is using defrag == "always". __GFP_NORETRY should be included for any thp allocation and it's a regression that it doesn't.
On Thu, Oct 04, 2018 at 01:16:32PM -0700, David Rientjes wrote: > On Tue, 25 Sep 2018, Michal Hocko wrote: > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > index da858f794eb6..149b6f4cf023 100644 > > --- a/mm/mempolicy.c > > +++ b/mm/mempolicy.c > > @@ -2046,8 +2046,36 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, > > nmask = policy_nodemask(gfp, pol); > > if (!nmask || node_isset(hpage_node, *nmask)) { > > mpol_cond_put(pol); > > - page = __alloc_pages_node(hpage_node, > > - gfp | __GFP_THISNODE, order); > > + /* > > + * We cannot invoke reclaim if __GFP_THISNODE > > + * is set. Invoking reclaim with > > + * __GFP_THISNODE set, would cause THP > > + * allocations to trigger heavy swapping > > + * despite there may be tons of free memory > > + * (including potentially plenty of THP > > + * already available in the buddy) on all the > > + * other NUMA nodes. > > + * > > + * At most we could invoke compaction when > > + * __GFP_THISNODE is set (but we would need to > > + * refrain from invoking reclaim even if > > + * compaction returned COMPACT_SKIPPED because > > + * there wasn't not enough memory to succeed > > + * compaction). For now just avoid > > + * __GFP_THISNODE instead of limiting the > > + * allocation path to a strict and single > > + * compaction invocation. > > + * > > + * Supposedly if direct reclaim was enabled by > > + * the caller, the app prefers THP regardless > > + * of the node it comes from so this would be > > + * more desiderable behavior than only > > + * providing THP originated from the local > > + * node in such case. > > + */ > > + if (!(gfp & __GFP_DIRECT_RECLAIM)) > > + gfp |= __GFP_THISNODE; > > + page = __alloc_pages_node(hpage_node, gfp, order); > > goto out; > > } > > } > > This causes, on average, a 13.9% access latency regression on Haswell, and > the regression would likely be more severe on Naples and Rome. > That assumes that fragmentation prevents easy allocation which may very well be the case. While it would be great that compaction or the page allocator could be further improved to deal with fragmentation, it's outside the scope of this patch. > There exist libraries that allow the .text segment of processes to be > remapped to memory backed by transparent hugepages and use MADV_HUGEPAGE > to stress local compaction to defragment node local memory for hugepages > at startup. That is taking advantage of a co-incidence of the implementation. MADV_HUGEPAGE is *advice* that huge pages be used, not what the locality is. A hint for strong locality preferences should be separate advice (madvise) or a separate memory policy. Doing that is outside the context of this patch but nothing stops you introducing such a policy or madvise, whichever you think would be best for the libraries to consume (I'm only aware of libhugetlbfs but there might be others). > The cost, including the statistics Mel gathered, is > acceptable for these processes: they are not concerned with startup cost, > they are concerned only with optimal access latency while they are > running. > Then such applications at startup have the option of setting zone_reclaim_mode during initialisation assuming a privileged helper can be created. That would be somewhat heavy handed and a longer-term solution would still be to create a proper memory policy of madvise flag for those libraries. > So while it may take longer to start the process because memory compaction > is attempting to allocate hugepages with __GFP_DIRECT_RECLAIM, in the > cases where compaction is successful, this is a very significant long-term > win. In cases where compaction fails, falling back to local pages of the > native page size instead of remote thp is a win for the remaining time > this process wins: as stated, 13.9% faster for all memory accesses to the > process's text while it runs on Haswell. > Again, I remind you that it only benefits applications that prefectly fit into NUMA nodes. Not all applications are created with that level of awareness and easily get thrashed if using MADV_HUGEPAGE and do not fit into a NUMA node. While it is unfortunate that there are specialised applications that benefit from the current configuration, I bet there is heavier usage of qemu affected by the bug this patch addresses than specialised applications that both fit perfectly into NUMA nodes and are extremely sensitive to access latencies. It's a question of causing the least harm to the most users which is what this patch does. If you need behaviour for more agressive reclaim or locality hints then kindly introduce them and do not depend in MADV_HUGEPAGE accidentically doubling up as hints about memory locality.
On Fri, 5 Oct 2018, Mel Gorman wrote: > > This causes, on average, a 13.9% access latency regression on Haswell, and > > the regression would likely be more severe on Naples and Rome. > > > > That assumes that fragmentation prevents easy allocation which may very > well be the case. While it would be great that compaction or the page > allocator could be further improved to deal with fragmentation, it's > outside the scope of this patch. > Hi Mel, The regression that Andrea is working on, correct me if I'm wrong, is heavy reclaim and swapping activity that is trying to desperately allocate local hugepages when the local node is fragmented based on advice provided by MADV_HUGEPAGE. Why is it ever appropriate to do heavy reclaim and swap activity to allocate a transparent hugepage? This is exactly what the __GFP_NORETRY check for high-order allocations is attempting to avoid, and it explicitly states that it is for thp faults. The fact that we lost __GFP_NORERY for thp allocations for all settings, including the default setting, other than yours (setting of "always") is what I'm focusing on. There is no guarantee that this activity will free an entire pageblock or that it is even worthwhile. Why is thp memory ever being allocated without __GFP_NORETRY as the page allocator expects? That aside, removing __GFP_THISNODE can make the fault latency much worse if remote notes are fragmented and/or reclaim has the inability to free contiguous memory, which it likely cannot. This is where I measured over 40% fault latency regression from Linus's tree with this patch on a fragmnented system where order-9 memory is neither available from node 0 or node 1 on Haswell. > > There exist libraries that allow the .text segment of processes to be > > remapped to memory backed by transparent hugepages and use MADV_HUGEPAGE > > to stress local compaction to defragment node local memory for hugepages > > at startup. > > That is taking advantage of a co-incidence of the implementation. > MADV_HUGEPAGE is *advice* that huge pages be used, not what the locality > is. A hint for strong locality preferences should be separate advice > (madvise) or a separate memory policy. Doing that is outside the context > of this patch but nothing stops you introducing such a policy or madvise, > whichever you think would be best for the libraries to consume (I'm only > aware of libhugetlbfs but there might be others). > The behavior that MADV_HUGEPAGE specifies is certainly not clearly defined, unfortunately. The way that an application writer may read it, as we have, is that it will make a stronger attempt at allocating a hugepage at fault. This actually works quite well when the allocation correctly has __GFP_NORETRY, as it's supposed to, and compaction is MIGRATE_ASYNC. So rather than focusing on what MADV_HUGEPAGE has meant over the past 2+ years of kernels that we have implemented based on, or what it meant prior to that, is a fundamental question of the purpose of direct reclaim and swap activity that had always been precluded before __GFP_NORETRY was removed in a thp allocation. I don't think anybody in this thread wants 14% remote access latency regression if we allocate remotely or 40% fault latency regression when remote nodes are fragmented as well. Removing __GFP_THISNODE only helps when remote memory is not fragmented, otherwise it multiplies the problem as I've shown. The numbers that you provide while using the non-default option to mimick MADV_HUGEPAGE mappings but also use __GFP_NORETRY makes the actual source of the problem quite easy to identify: there is an inconsistency in the thp gfp mask and the page allocator implementation. > > The cost, including the statistics Mel gathered, is > > acceptable for these processes: they are not concerned with startup cost, > > they are concerned only with optimal access latency while they are > > running. > > > > Then such applications at startup have the option of setting > zone_reclaim_mode during initialisation assuming a privileged helper > can be created. That would be somewhat heavy handed and a longer-term > solution would still be to create a proper memory policy of madvise flag > for those libraries. > We *never* want to use zone_reclaim_mode for these allocations, that would be even worse, we do not want to reclaim because we have a very unlikely chance of making pageblocks free without the involvement of compaction. We want to trigger memory compaction with a well-bounded cost that MIGRATE_ASYNC provides and then fail.
Hi, On Fri, Oct 05, 2018 at 01:35:15PM -0700, David Rientjes wrote: > Why is it ever appropriate to do heavy reclaim and swap activity to > allocate a transparent hugepage? This is exactly what the __GFP_NORETRY > check for high-order allocations is attempting to avoid, and it explicitly > states that it is for thp faults. The fact that we lost __GFP_NORERY for > thp allocations for all settings, including the default setting, other > than yours (setting of "always") is what I'm focusing on. There is no > guarantee that this activity will free an entire pageblock or that it is > even worthwhile. I tried to add just __GFP_NORETRY but it changes nothing. Try it yourself if you think that can resolve the swap storm and excessive reclaim CPU overhead... and see if it works. I didn't intend to reinvent the wheel with __GFP_COMPACT_ONLY, if __GFP_NORETRY would have worked. I tried adding __GFP_NORETRY first of course. Reason why it doesn't help is: compaction fails because not enough free RAM, reclaim is invoked, compaction succeeds, THP is allocated to your lib user, compaction fails because not enough free RAM, reclaim is invoked etc.. compact_result is not COMPACT_DEFERRED, but COMPACT_SKIPPED. See the part "reclaim is invoked" (with __GFP_THISNODE), is enough to still create the same heavy swap storm and unfairly penalize all apps with memory allocated in the local node like if your library had actually the kernel privilege to run mbind or mlock, which is not ok. Only __GFP_COMPACT_ONLY truly can avoid reclaim, the moment reclaim can run with __GFP_THISNODE set, all bets are off and we're back to square one, no difference (at best marginal difference) with __GFP_NORETRY being set. > That aside, removing __GFP_THISNODE can make the fault latency much worse > if remote notes are fragmented and/or reclaim has the inability to free > contiguous memory, which it likely cannot. This is where I measured over > 40% fault latency regression from Linus's tree with this patch on a > fragmnented system where order-9 memory is neither available from node 0 > or node 1 on Haswell. Discussing the drawbacks of removing __GFP_THISNODE is an orthogonal topic. __GFP_COMPACT_ONLY approach didn't have any of those drawbacks about the remote latency because __GFP_THISNODE was still set at all times, just as you like it. You seem to think __GFP_NORETRY will work as well as __GFP_COMPACT_ONLY but it doesn't. Calling compaction (and only compaction!) with __GFP_THISNODE set doesn't break anything and that was what __GFP_COMPACT_ONLY was about. > The behavior that MADV_HUGEPAGE specifies is certainly not clearly > defined, unfortunately. The way that an application writer may read it, > as we have, is that it will make a stronger attempt at allocating a > hugepage at fault. This actually works quite well when the allocation > correctly has __GFP_NORETRY, as it's supposed to, and compaction is > MIGRATE_ASYNC. Like Mel said, your app just happens to fit in a local node, if the user of the lib is slightly different and allocates 16G on a system where each node is 4G, the post-fix MADV_HUGEPAGE will perform extremely better also for the lib user. And you know, if the lib user fits in one node, it can use mbind and it won't hit OOM... and you'd need some capability giving the app privilege anyway to keep MADV_HUGEPAGE as deep and unfair to the rest of the processes running the local node (like mbind and mlock require too). Could you just run a test with the special lib and allocate 4 times the size of a node, and see how the lib performs with upstream and upstream+fix? Feel free to add __GFP_NORETRY anywhere you like in the test of the upstream without fix. The only constraint I would ask for the test (if the app using the lib is not a massively multithreaded app, like qemu is, and you just intend to run malloc(SIZEOFNODE*4); memset) is to run the app-lib under "taskset -c 0". Otherwise NUMA balancing could move the the CPU next to the last memory touched, which couldn't be done if each thread accesses all ram at random from all 4 nodes at the same time (which is a totally legitimate workload too and must not hit the "pathological THP allocation performance"). > removed in a thp allocation. I don't think anybody in this thread wants > 14% remote access latency regression if we allocate remotely or 40% fault > latency regression when remote nodes are fragmented as well. Did you try the __GFP_COMPACT_ONLY patch? That won't have the 40% fault latency already. Also you're underestimating the benefit of THP given from remote nodes for virt a bit, the 40% fault latency is not an issue when the allocation is long lived, which is what MADV_HUGEPAGE is telling the kernel, and the benefit of THP for guest is multiplied. It's more a feature than a bug that 40% fault latency with MADV_HUGEPAGE set at least for all long lived allocations (but if the allocations aren't long lived, why should MADV_HUGEPAGE have been set in the first place?). With the fix applied, if you want to add __GFP_THISNODE to compaction only by default you still can, that solves the 40% fault latency just like __GFP_COMPACT_ONLY patch would also have avoided that 40% increased fault latency. The issue here is very simple: you can't use __GFP_THISNODE for an allocation that can invoke reclaim, unless you have mlock or mbind higher privilege capabilities. Compaction can be totally called with __GFP_THISNODE. Just I believe KVM prefers the fault latency increased as it only happens during VM warmup phase and it prefers more THP immediately. Compaction on the remote nodes (i.e. the higher latency) is not wasted CPU for KVM, as it would be for a short lived allocations (MADV_HUGEPAGE incidentally is about to tell the kernel which allocations are long lived and to tell the kernel when it's worth running compaction in direct reclaim). > The numbers that you provide while using the non-default option to mimick > MADV_HUGEPAGE mappings but also use __GFP_NORETRY makes the actual source > of the problem quite easy to identify: there is an inconsistency in the > thp gfp mask and the page allocator implementation. I'm still unconvinced about the __GFP_NORETRY arguments because I tried it already. However if you send a patch that fixes everything by only adding __GFP_NORETRY in every place you wish, I'd be glad to test it and verify if it actually solves the problem. Also note: +#define __GFP_ONLY_COMPACT ((__force gfp_t)(___GFP_NORETRY | \ + ___GFP_ONLY_COMPACT)) If ___GFP_NORETRY would have been enough __GFP_ONLY_COMPACT would have defined to ___GFP_NORETRY alone. When I figured it wasn't enough I added ___GFP_ONLY_COMPACT to retain the __GFP_THISNODE in the only place it's safe to retain it (i.e. compaction). Thanks, Andrea
Hello, On Thu, Oct 04, 2018 at 04:05:26PM -0700, David Rientjes wrote: > The source of the problem needs to be addressed: memory compaction. We > regress because we lose __GFP_NORETRY and pointlessly try reclaim, but I commented in detail about the __GFP_NORETRY topic in the other email so I will skip the discussion about __GFP_NORETRY in the context of this answer except for the comment at the end of the email to the actual code that implements __GFP_NORETRY. > But that's a memory compaction issue, not a thp gfp mask issue; the > reclaim issue is responded to below. Actually memory compaction has no issues whatsoever with __GFP_THISNODE regardless of __GFP_NORETRY. > This patch causes an even worse regression if all system memory is > fragmented such that thp cannot be allocated because it tries to stress > compaction on remote nodes, which ends up unsuccessfully, not just the > local node. > > On Haswell, when all memory is fragmented (not just the local node as I > obtained by 13.9% regression result), the patch results in a fault latency > regression of 40.9% for MADV_HUGEPAGE region of 8GB. This is because it > is thrashing both nodes pointlessly instead of just failing for > __GFP_THISNODE. There's no I/O involved at the very least on compaction, nor we drop any cache or shrink any slab by mistake by just invoking compaction. Even when you hit the worst case "all nodes are 100% fragmented" scenario that generates the 40% increased allocation latency, all other tasks running in the local node will keep running fine, and they won't be pushed away forcefully into swap with all their kernel cache depleted, which is a mlock/mbind privileged behavior that the app using the MADV_HUGEPAGE lib should not ever been able to inflict on other processes running in the node from different users (users as in uid). Furthermore when you incur the worst case latency after that there's compact deferred logic skipping compaction next time around if all nodes were so fragmented to the point of guaranteed failure. While there's nothing stopping reclaim to run every time COMPACT_SKIPPED is returned just because compaction keeps succeeding as reclaim keeps pushing more 2M amounts into swap from the local nodes. I don't doubt with 1024 nodes things can get pretty bad when they're all 100% fragmented, __GFP_THISNODE would win in such case, but then what you're asking then is the __GFP_COMPACT_ONLY behavior. That will solve it. What we'd need probably regardless of how we solve this bug (because not all compaction invocations are THP invocations... and we can't keep making special cases and optimizations tailored for THP or we end up in that same 40% higher latency for large skbs and other stuff) is a more sophisticated COMPACT_DEFERRED logic where you can track when remote compaction failed. Then you wait many more times before trying a global compaction. It could be achieved with just a compact_deferred counter in the zone/pgdat (wherever it fits best). Overall I don't think the bug we're dealing with and the slowdown of compaction on the remote nodes are comparable, also considering the latter will still happen regardless if you've large skbs or other drivers allocating large amounts of memory as an optimization. > So the end result is that the patch regresses access latency forever by > 13.9% when the local node is fragmented because it is accessing remote thp > vs local pages of the native page size, and regresses fault latency of > 40.9% when the system is fully fragmented. The only time that fault > latency is improved is when remote memory is not fully fragmented, but > then you must incur the remote access latency. You get THP however which will reduce the TLB miss cost and maximize TLB usage, so it depends on the app if that 13.9% cost is actually offseted by the THP benefit or not. It entirely depends if large part of the workload mostly fits in in-socket CPU cache. The more the in-socket/node CPU cache pays off, the more remote-THP also pays off. There would be definitely workloads that would run faster, not slower, with the remote THP instead of local PAGE_SIZEd memory. The benefit of THP is also larger for the guest loads than for host loads, so it depends on that too. We agree about the latency issue with a ton of RAM and thousands of nodes, but again that can be mitigated with a NUMA friendly COMPACT_DEFERRED logic NUMA aware. Even without such NUMA-aware-compact_deferred logic improvement, the worst case of the remote compaction behavior still doesn't look nearly as bad as this bug by thinking about it. And it only is a concern for extremely large NUMA systems (which may run the risk of running in other solubility issues in other places if random workloads are applied to it and all nodes are low on memory and fully fragmented which is far from common scenario on those large systems), while the bug we fixed was hurting badly all very common 2 nodes installs with workloads that are common and should run fine. > Direct reclaim doesn't make much sense for thp allocations if compaction > has failed, even for MADV_HUGEPAGE. I've discounted Mel's results because > he is using thp defrag set to "always", which includes __GFP_NORETRY but > the default option and anything else other than "always" does not use > __GFP_NORETRY like the page allocator believes it does: > > /* > * Checks for costly allocations with __GFP_NORETRY, which > * includes THP page fault allocations > */ > if (costly_order && (gfp_mask & __GFP_NORETRY)) { > /* > * If compaction is deferred for high-order allocations, > * it is because sync compaction recently failed. If > * this is the case and the caller requested a THP > * allocation, we do not want to heavily disrupt the > * system, so we fail the allocation instead of entering > * direct reclaim. > */ > if (compact_result == COMPACT_DEFERRED) > goto nopage; > > So he is avoiding the cost of reclaim, which you are not, specifically > because he is using defrag == "always". __GFP_NORETRY should be included > for any thp allocation and it's a regression that it doesn't. Compaction doesn't fail, it returns COMPACT_SKIPPED and it asks to do a run of reclam to generate those 2M of PAGE_SIZEd memory required to move a 2M piece into the newly freely PAGE_SIZEd fragments. So __GFP_NORETRY never jumps to "nopage" and it never gets deferred either. For the record, I didn't trace it literally with gdb to validate my theory of why forcefully adding __GFP_NORETRY didn't move the needle, so feel free to do more investigations in that area if you see any pitfall in the theory. Thanks, Andrea
On Fri, 5 Oct 2018, Andrea Arcangeli wrote: > I tried to add just __GFP_NORETRY but it changes nothing. Try it > yourself if you think that can resolve the swap storm and excessive > reclaim CPU overhead... and see if it works. I didn't intend to > reinvent the wheel with __GFP_COMPACT_ONLY, if __GFP_NORETRY would > have worked. I tried adding __GFP_NORETRY first of course. > > Reason why it doesn't help is: compaction fails because not enough > free RAM, reclaim is invoked, compaction succeeds, THP is allocated to > your lib user, compaction fails because not enough free RAM, reclaim > is invoked etc.. compact_result is not COMPACT_DEFERRED, but > COMPACT_SKIPPED. > > See the part "reclaim is invoked" (with __GFP_THISNODE), is enough to > still create the same heavy swap storm and unfairly penalize all apps > with memory allocated in the local node like if your library had > actually the kernel privilege to run mbind or mlock, which is not ok. > > Only __GFP_COMPACT_ONLY truly can avoid reclaim, the moment reclaim > can run with __GFP_THISNODE set, all bets are off and we're back to > square one, no difference (at best marginal difference) with > __GFP_NORETRY being set. > The page allocator is expecting __GFP_NORETRY for thp allocations per its comment: /* * Checks for costly allocations with __GFP_NORETRY, which * includes THP page fault allocations */ if (costly_order && (gfp_mask & __GFP_NORETRY)) { And that enables us to check compact_result to determine whether thp allocation should fail or continue to reclaim. I don't think it helps that some thp fault allocations use __GFP_NORETRY and others do not. I think that deserves a fix to alloc_hugepage_direct_gfpmask() or GFP_TRANSHUGE_LIGHT. Our library that uses MADV_HUGEPAGE only invokes direct compaction because we're on an older kernel: it does not attempt to do reclaim to make compaction happy so that it finds memory that it can migrate memory to. For reference, we use defrag setting of "defer+madvise". Everybody who does not use MADV_HUGEPAGE kicks off kswapd/kcompactd and fails, MADV_HUGEPAGE users do the same but also try direct compaction. That direct compaction uses a mode of MIGRATE_ASYNC so it normally fails because of need_resched() or spinlock contention. These allocations always fail, MADV_HUGEPAGE or otherwise, without invoking direct reclaim. I am agreeing with both you and Mel that it makes no sense to thrash the local node to make compaction happy and then hugepage-order memory available. I'm only differing with you on the mechanism to fail early: we never want to do attempt reclaim on thp allocations specifically because it leads to the behavior you are addressing. My contention is that removing __GFP_THISNODE papers over the problem, especially in cases where remote memory is also fragmnented. It incurs a much higher (40%) fault latency and then incurs 13.9% greater access latency. It is not a good result, at least for Haswell, Naples, and Rome. To make a case that we should fault hugepages remotely as fallback, either for non-MADV_HUGEPAGE users who do not use direct compaction, or MADV_HUGEPAGE users who use direct compaction, we need numbers that suggest there is a performance benefit in terms of access latency to suggest that it is better; this is especially the case when the fault latency is 40% higher. On Haswell, Naples, and Rome, it is quite obvious that this patch works much harder to fault memory remotely that incurs a substantial performance penalty when it fails and in the cases where it succeeds we have much worse access latency. For users who bind their applications to a subset of cores on a NUMA node, fallback is egregious: we are guaranteed to never have local access latency and in the case when the local node is fragmented and remote memory is not our fault latency goes through the roof when local native pages would have been much better. I've brought numbers on how poor this patch performs, so I'm asking for a rebuttal that suggests it is better on some platforms. (1) On what platforms is it better to fault remote hugepages over local native pages? (2) What is the fault latency increase when remote nodes are fragmented as well? > Like Mel said, your app just happens to fit in a local node, if the > user of the lib is slightly different and allocates 16G on a system > where each node is 4G, the post-fix MADV_HUGEPAGE will perform > extremely better also for the lib user. > It won't if the remote memory is fragmented; the patch is premised on the argument that remote memory is never fragmented or under memory pressure otherwise it is multiplying the fault latency by the number of nodes. Sure, creating test cases where the local node is under heavy memory pressure yet remote nodes are mostly free will minimize the impact this has on fault latency. It still is a substantial increase, as I've measured on Haswell, but the access latency forever is the penalty. This patch cannot make access to remote memory faster. > And you know, if the lib user fits in one node, it can use mbind and > it won't hit OOM... and you'd need some capability giving the app > privilege anyway to keep MADV_HUGEPAGE as deep and unfair to the rest > of the processes running the local node (like mbind and mlock require > too). > No, the lib user does not fit into a single node, we have cases where these nodes are under constant memory pressure. We are on an older kernel, however, that fails because of need_resched() and spinlock contention rather than falling back to local reclaim. > Did you try the __GFP_COMPACT_ONLY patch? That won't have the 40% > fault latency already. > I remember Kirill saying that he preferred node local memory allocation over thp allocation in the review, which I agree with, but it was much better than this patch. I see no reason why we should work so hard to reclaim memory, thrash local memory, and swap so that the compaction freeing scanner can find memory when there is *no* guarantee that the migration scanner can free an entire pageblock. That is completely pointless work, even for an MADV_HUGEPAGE user, and we certainly don't have such pathological behavior on older kernels. > Also you're underestimating the benefit of THP given from remote nodes > for virt a bit, the 40% fault latency is not an issue when the > allocation is long lived, which is what MADV_HUGEPAGE is telling the > kernel, and the benefit of THP for guest is multiplied. It's more a > feature than a bug that 40% fault latency with MADV_HUGEPAGE set at > least for all long lived allocations (but if the allocations aren't > long lived, why should MADV_HUGEPAGE have been set in the first place?). > For the long-term benefit of thp to outweigh the 40% fault latency increase, it would require that the access latency is better on remote memory. It's worse, it's 13.9% worse access latency on Haswell. It's a negative result for both allocation and access latency. > With the fix applied, if you want to add __GFP_THISNODE to compaction > only by default you still can, that solves the 40% fault latency just > like __GFP_COMPACT_ONLY patch would also have avoided that 40% > increased fault latency. > > The issue here is very simple: you can't use __GFP_THISNODE for an > allocation that can invoke reclaim, unless you have mlock or mbind > higher privilege capabilities. > Completely agreed, and I'm very strongly suggesting that we do not invoke reclaim. Memory compaction provides no guarantee that it can free an entire pageblock, even for MIGRATE_ASYNC compaction under MADV_HUGEPAGE it will attempt to migrate SWAP_CLUSTER_MAX pages from a pageblock without considering if the entire pageblock could eventually become freed for order-9 memory. The vast majority of the time this reclaim could be completely pointless. We cannot use it. > I'm still unconvinced about the __GFP_NORETRY arguments because I > tried it already. However if you send a patch that fixes everything by > only adding __GFP_NORETRY in every place you wish, I'd be glad to test > it and verify if it actually solves the problem. Also note: > > +#define __GFP_ONLY_COMPACT ((__force gfp_t)(___GFP_NORETRY | \ > + ___GFP_ONLY_COMPACT)) > > If ___GFP_NORETRY would have been enough __GFP_ONLY_COMPACT would have > defined to ___GFP_NORETRY alone. When I figured it wasn't enough I > added ___GFP_ONLY_COMPACT to retain the __GFP_THISNODE in the only > place it's safe to retain it (i.e. compaction). > We don't need __GFP_NORETRY anymore with this, it wouldn't be useful to continually compact memory in isolation if it already has failed. It seems strange to be using __GFP_DIRECT_RECLAIM | __GFP_ONLY_COMPACT, though, as I assume this would evolve into. Are there any other proposed users for __GFP_ONLY_COMPACT beyond thp allocations? If not, we should just save the gfp bit and encode the logic directly into the page allocator. Would you support this? --- diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c --- a/drivers/gpu/drm/ttm/ttm_page_alloc.c +++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c @@ -860,7 +860,7 @@ static int ttm_get_pages(struct page **pages, unsigned npages, int flags, while (npages >= HPAGE_PMD_NR) { gfp_t huge_flags = gfp_flags; - huge_flags |= GFP_TRANSHUGE_LIGHT | __GFP_NORETRY | + huge_flags |= GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM; huge_flags &= ~__GFP_MOVABLE; huge_flags &= ~__GFP_COMP; @@ -978,13 +978,13 @@ int ttm_page_alloc_init(struct ttm_mem_global *glob, unsigned max_pages) GFP_USER | GFP_DMA32, "uc dma", 0); ttm_page_pool_init_locked(&_manager->wc_pool_huge, - (GFP_TRANSHUGE_LIGHT | __GFP_NORETRY | + (GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM) & ~(__GFP_MOVABLE | __GFP_COMP), "wc huge", order); ttm_page_pool_init_locked(&_manager->uc_pool_huge, - (GFP_TRANSHUGE_LIGHT | __GFP_NORETRY | + (GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM) & ~(__GFP_MOVABLE | __GFP_COMP) , "uc huge", order); diff --git a/include/linux/gfp.h b/include/linux/gfp.h --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -298,7 +298,8 @@ struct vm_area_struct; #define GFP_HIGHUSER (GFP_USER | __GFP_HIGHMEM) #define GFP_HIGHUSER_MOVABLE (GFP_HIGHUSER | __GFP_MOVABLE) #define GFP_TRANSHUGE_LIGHT ((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \ - __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM) + __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \ + ~__GFP_RECLAIM) #define GFP_TRANSHUGE (GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM) /* Convert GFP flags to their corresponding migrate type */ diff --git a/mm/huge_memory.c b/mm/huge_memory.c --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -628,13 +628,15 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, * madvise: directly stall for MADV_HUGEPAGE, otherwise fail if not immediately * available * never: never stall for any thp allocation + * + * "Stalling" here implies direct memory compaction but not direct reclaim. */ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma) { const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE); if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags)) - return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY); + return GFP_TRANSHUGE; if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags)) return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM; if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags)) diff --git a/mm/page_alloc.c b/mm/page_alloc.c --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4145,6 +4145,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (compact_result == COMPACT_DEFERRED) goto nopage; + /* + * If faulting a hugepage, it is very unlikely that + * thrashing the zonelist is going to assist compaction + * in freeing an entire pageblock. There are no + * guarantees memory compaction can free an entire + * pageblock under such memory pressure that it is + * better to simply fail and fallback to native pages. + */ + if (order == pageblock_order && + !(current->flags & PF_KTHREAD)) + goto nopage; + /* * Looks like reclaim/compaction is worth trying, but * sync compaction could be very expensive, so keep
On Mon, Oct 08, 2018 at 01:41:09PM -0700, David Rientjes wrote: > On Fri, 5 Oct 2018, Andrea Arcangeli wrote: > > > I tried to add just __GFP_NORETRY but it changes nothing. Try it > > yourself if you think that can resolve the swap storm and excessive > > reclaim CPU overhead... and see if it works. I didn't intend to > > reinvent the wheel with __GFP_COMPACT_ONLY, if __GFP_NORETRY would > > have worked. I tried adding __GFP_NORETRY first of course. > > > > Reason why it doesn't help is: compaction fails because not enough > > free RAM, reclaim is invoked, compaction succeeds, THP is allocated to > > your lib user, compaction fails because not enough free RAM, reclaim > > is invoked etc.. compact_result is not COMPACT_DEFERRED, but > > COMPACT_SKIPPED. > > > > See the part "reclaim is invoked" (with __GFP_THISNODE), is enough to > > still create the same heavy swap storm and unfairly penalize all apps > > with memory allocated in the local node like if your library had > > actually the kernel privilege to run mbind or mlock, which is not ok. > > > > Only __GFP_COMPACT_ONLY truly can avoid reclaim, the moment reclaim > > can run with __GFP_THISNODE set, all bets are off and we're back to > > square one, no difference (at best marginal difference) with > > __GFP_NORETRY being set. > > > > The page allocator is expecting __GFP_NORETRY for thp allocations per its > comment: > > /* > * Checks for costly allocations with __GFP_NORETRY, which > * includes THP page fault allocations > */ > if (costly_order && (gfp_mask & __GFP_NORETRY)) { > > And that enables us to check compact_result to determine whether thp > allocation should fail or continue to reclaim. I don't think it helps > that some thp fault allocations use __GFP_NORETRY and others do not. I > think that deserves a fix to alloc_hugepage_direct_gfpmask() or > GFP_TRANSHUGE_LIGHT. > I am concerned that we may be trying to deal with this in terms of right and wrong when the range of workloads and their preferred semantics force us into a grey area. The THP user without __GFP_NORETRY is "always" always means that an application requesting THP will stall on allocation failure and directly reclaim pages and compact memory in an effort to allocate a THP immediately. This may be desirable for virtual machines that benefit heavily from THP use and are willing to delay the VM start to utilise them. Removing __GFP_NORETRY in this instance is due to the tuning hinting that direct reclaim is prefectly acceptable. __GFP_NORETRY means that the kernel will not compact memory after a recent failure but if the caller is willing to reclaim memory, then it follows that retrying compaction is justified. Is this the correct thing to do in 100% of cases? Probably not, there will be corner cases but it also does not necessarily follow that __GFP_NORETRY should be univeral. What is missing here is an agreed upon set of reference test cases that can be used for the basis of evaluating patches like this. They should be somewhat representative of the target applications of virtual memory initialisation (forget about runtime at the moment as that is stacking problems), a simulator of the google workload and library and my test case of simply referencing an amount of memory larger than one node. That would cover the current discussion at least but more would be needed later. Otherwise we're going to endlessly whack-a-mole fixing one workload and hurting another. It might be overkill but otherwise this discussion risks going in circles. The previous reference cases were ones that focused on either THP allocation success rates or the benefit of THP itself, neither of which are particularly useful in the current context. > Our library that uses MADV_HUGEPAGE only invokes direct compaction because > we're on an older kernel: it does not attempt to do reclaim to make > compaction happy so that it finds memory that it can migrate memory to. > For reference, we use defrag setting of "defer+madvise". Everybody who > does not use MADV_HUGEPAGE kicks off kswapd/kcompactd and fails, > MADV_HUGEPAGE users do the same but also try direct compaction. That > direct compaction uses a mode of MIGRATE_ASYNC so it normally fails > because of need_resched() or spinlock contention. > > These allocations always fail, MADV_HUGEPAGE or otherwise, without > invoking direct reclaim. > > I am agreeing with both you and Mel that it makes no sense to thrash the > local node to make compaction happy and then hugepage-order memory > available. I'm only differing with you on the mechanism to fail early: we > never want to do attempt reclaim on thp allocations specifically because > it leads to the behavior you are addressing. > > My contention is that removing __GFP_THISNODE papers over the problem, > especially in cases where remote memory is also fragmnented. It incurs a > much higher (40%) fault latency and then incurs 13.9% greater access > latency. It is not a good result, at least for Haswell, Naples, and Rome. > Ok, this is a good point. MADV_HUGEPAGE is hinting that huge pages are desired and in recent kernels, that also meant they would be local pages i.e. locality is more important than huge pages even if huge pages are desirable. With this change, it indicates that huge pages are more important than locality because that is what userspace hinted. Which is better? locality is more important if your workload fits in memory and the initialisation phase is not performance-critical. It would also be more important if your workload is larger than a node but the critical working set fits within a node while the other usages are like streaming readers and writers where the data is referenced once and can be safely reclaimed later. huge pages is more important if your workload is virtualised and benefits heavily due to reduced TLB cost from EPT. Similarly it's better if the workload has high special locality, particularly if it is also fitting within a cache where the remote cost is masked. These are simple examples and even then we cannot detect which case applies in advance so it falls back to what does the hint mean? The name suggests that huge pages are desirable and the locality is a hint. Granted, if that means THP is going remote and incurring cost there, it might be worse overall for some workloads, particularly if the system is fragmented. This goes back to my point that the MADV_HUGEPAGE hint should not make promises about locality and that introducing MADV_LOCAL for specialised libraries may be more appropriate with the initial semantic being how it treats MADV_HUGEPAGE regions. > To make a case that we should fault hugepages remotely as fallback, either > for non-MADV_HUGEPAGE users who do not use direct compaction, or > MADV_HUGEPAGE users who use direct compaction, we need numbers that > suggest there is a performance benefit in terms of access latency to > suggest that it is better; this is especially the case when the fault > latency is 40% higher. On Haswell, Naples, and Rome, it is quite obvious > that this patch works much harder to fault memory remotely that incurs a > substantial performance penalty when it fails and in the cases where it > succeeds we have much worse access latency. > > For users who bind their applications to a subset of cores on a NUMA node, > fallback is egregious: we are guaranteed to never have local access > latency and in the case when the local node is fragmented and remote > memory is not our fault latency goes through the roof when local native > pages would have been much better. > > I've brought numbers on how poor this patch performs, so I'm asking for a > rebuttal that suggests it is better on some platforms. (1) On what > platforms is it better to fault remote hugepages over local native pages? > (2) What is the fault latency increase when remote nodes are fragmented as > well? > On the flip side, we've also heard how some other applications are adversely affected such as virtual machine start-ups. What you're asking for asking is an astonishing amount of work -- a multi platform study against a wide variety of workloads with the addition that some test should be able to start in a fragmented state that is reproducible. What is being asked of you is to consider introducing a new madvise hint and having MADV_HUGEPAGE being about huge pages and introducing a new hint that is hinting about locality without the strictness of memory binding. That is significantly more tractable and you also presumably have access to a reasonable reproduction cases even though I doubt you can release it. > > Like Mel said, your app just happens to fit in a local node, if the > > user of the lib is slightly different and allocates 16G on a system > > where each node is 4G, the post-fix MADV_HUGEPAGE will perform > > extremely better also for the lib user. > > > > It won't if the remote memory is fragmented; the patch is premised on the > argument that remote memory is never fragmented or under memory pressure > otherwise it is multiplying the fault latency by the number of nodes. > Sure, creating test cases where the local node is under heavy memory > pressure yet remote nodes are mostly free will minimize the impact this > has on fault latency. It still is a substantial increase, as I've > measured on Haswell, but the access latency forever is the penalty. This > patch cannot make access to remote memory faster. > Indeed not but remote costs are highly platform dependant so decisions made for haswell are not necessarily justified on EPYC or any later Intel processor either. Because these goalposts will keep changing, I think it's better to have more clearly defined madvise hints and rely less on implementation details that are subject to change. Otherwise we will get stuck in an endless debate about "is workload X more important than Y"? No doubt this patch will have problems but it treats MADV_HUGEPAGE as a more stricter hint that huge pages are desirable. > > <SNIP> > > > > I'm still unconvinced about the __GFP_NORETRY arguments because I > > tried it already. However if you send a patch that fixes everything by > > only adding __GFP_NORETRY in every place you wish, I'd be glad to test > > it and verify if it actually solves the problem. Also note: > > > > +#define __GFP_ONLY_COMPACT ((__force gfp_t)(___GFP_NORETRY | \ > > + ___GFP_ONLY_COMPACT)) > > > > If ___GFP_NORETRY would have been enough __GFP_ONLY_COMPACT would have > > defined to ___GFP_NORETRY alone. When I figured it wasn't enough I > > added ___GFP_ONLY_COMPACT to retain the __GFP_THISNODE in the only > > place it's safe to retain it (i.e. compaction). > > > > We don't need __GFP_NORETRY anymore with this, it wouldn't be useful to > continually compact memory in isolation if it already has failed. It > seems strange to be using __GFP_DIRECT_RECLAIM | __GFP_ONLY_COMPACT, > though, as I assume this would evolve into. > > Are there any other proposed users for __GFP_ONLY_COMPACT beyond thp > allocations? If not, we should just save the gfp bit and encode the logic > directly into the page allocator. > > Would you support this? I don't think it's necessarily bad but it cannot distinguish between THP and hugetlbfs. Hugetlbfs users are typically more willing to accept high overheads as they may be required for the application to function. That's probably fixable but will still leave us in the state where MADV_HUGEPAGE is also a hint about locality. It'd still be interesting to hear if it fixes the VM initialisation issue but do note that if this patch is used as a replacement that hugetlbfs users may complain down the line.
[Sorry for being slow in responding but I was mostly offline last few days] On Tue 09-10-18 10:48:25, Mel Gorman wrote: [...] > This goes back to my point that the MADV_HUGEPAGE hint should not make > promises about locality and that introducing MADV_LOCAL for specialised > libraries may be more appropriate with the initial semantic being how it > treats MADV_HUGEPAGE regions. I agree with your other points and not going to repeat them. I am not sure madvise s the best API for the purpose though. We are talking about memory policy here and there is an existing api for that so I would _prefer_ to reuse it for this purpose. Sure we will likely need somethin in the compaction as well but we should start simple and go forward in smaller steps.
On Tue, Oct 09, 2018 at 02:27:45PM +0200, Michal Hocko wrote: > [Sorry for being slow in responding but I was mostly offline last few > days] > > On Tue 09-10-18 10:48:25, Mel Gorman wrote: > [...] > > This goes back to my point that the MADV_HUGEPAGE hint should not make > > promises about locality and that introducing MADV_LOCAL for specialised > > libraries may be more appropriate with the initial semantic being how it > > treats MADV_HUGEPAGE regions. > > I agree with your other points and not going to repeat them. I am not > sure madvise s the best API for the purpose though. We are talking about > memory policy here and there is an existing api for that so I would > _prefer_ to reuse it for this purpose. > I flip-flopped on that one in my head multiple times on the basis of how strict it should be. Memory policies tend to be black or white -- bind here, interleave there, etc. It wasn't clear to me what the best policy would be to describe "allocate local as best as you can but allow fallbacks if necessary". Hence, I started leaning towards advise as it is really about advice that the kernel can ignore if necessary. That said, I don't feel as strongly about the "how" as I do about the fact that applications and libraries should not depend on side-effects of the MADV_HUGEPAGE implementation that relate to locality.
On 10/8/18 10:41 PM, David Rientjes wrote: > + /* > + * If faulting a hugepage, it is very unlikely that > + * thrashing the zonelist is going to assist compaction > + * in freeing an entire pageblock. There are no > + * guarantees memory compaction can free an entire > + * pageblock under such memory pressure that it is > + * better to simply fail and fallback to native pages. > + */ > + if (order == pageblock_order && > + !(current->flags & PF_KTHREAD)) > + goto nopage; After we got rid of similar hardcoded heuristics, I would be very unhappy to start adding them back. A new gfp flag is also unfortunate, but more acceptable to me. > + > /* > * Looks like reclaim/compaction is worth trying, but > * sync compaction could be very expensive, so keep >
On Tue 09-10-18 14:00:34, Mel Gorman wrote: > On Tue, Oct 09, 2018 at 02:27:45PM +0200, Michal Hocko wrote: > > [Sorry for being slow in responding but I was mostly offline last few > > days] > > > > On Tue 09-10-18 10:48:25, Mel Gorman wrote: > > [...] > > > This goes back to my point that the MADV_HUGEPAGE hint should not make > > > promises about locality and that introducing MADV_LOCAL for specialised > > > libraries may be more appropriate with the initial semantic being how it > > > treats MADV_HUGEPAGE regions. > > > > I agree with your other points and not going to repeat them. I am not > > sure madvise s the best API for the purpose though. We are talking about > > memory policy here and there is an existing api for that so I would > > _prefer_ to reuse it for this purpose. > > > > I flip-flopped on that one in my head multiple times on the basis of > how strict it should be. Memory policies tend to be black or white -- > bind here, interleave there, etc. It wasn't clear to me what the best > policy would be to describe "allocate local as best as you can but allow > fallbacks if necessary". I was thinking about MPOL_NODE_PROXIMITY with the following semantic: - try hard to allocate from a local or very close numa node(s) even when that requires expensive operations like the memory reclaim/compaction before falling back to other more distant numa nodes. > Hence, I started leaning towards advise as it is > really about advice that the kernel can ignore if necessary. That said, > I don't feel as strongly about the "how" as I do about the fact that > applications and libraries should not depend on side-effects of the > MADV_HUGEPAGE implementation that relate to locality. completely agreed on this.
On Tue, Oct 09, 2018 at 04:25:10PM +0200, Michal Hocko wrote: > On Tue 09-10-18 14:00:34, Mel Gorman wrote: > > On Tue, Oct 09, 2018 at 02:27:45PM +0200, Michal Hocko wrote: > > > [Sorry for being slow in responding but I was mostly offline last few > > > days] > > > > > > On Tue 09-10-18 10:48:25, Mel Gorman wrote: > > > [...] > > > > This goes back to my point that the MADV_HUGEPAGE hint should not make > > > > promises about locality and that introducing MADV_LOCAL for specialised > > > > libraries may be more appropriate with the initial semantic being how it > > > > treats MADV_HUGEPAGE regions. > > > > > > I agree with your other points and not going to repeat them. I am not > > > sure madvise s the best API for the purpose though. We are talking about > > > memory policy here and there is an existing api for that so I would > > > _prefer_ to reuse it for this purpose. > > > > > > > I flip-flopped on that one in my head multiple times on the basis of > > how strict it should be. Memory policies tend to be black or white -- > > bind here, interleave there, etc. It wasn't clear to me what the best > > policy would be to describe "allocate local as best as you can but allow > > fallbacks if necessary". > > I was thinking about MPOL_NODE_PROXIMITY with the following semantic: > - try hard to allocate from a local or very close numa node(s) even when > that requires expensive operations like the memory reclaim/compaction > before falling back to other more distant numa nodes. > Seems reasonable. It's not far from the general semantics I thought MADV_LOCAL would have.
On Tue, 9 Oct 2018, Mel Gorman wrote: > > The page allocator is expecting __GFP_NORETRY for thp allocations per its > > comment: > > > > /* > > * Checks for costly allocations with __GFP_NORETRY, which > > * includes THP page fault allocations > > */ > > if (costly_order && (gfp_mask & __GFP_NORETRY)) { > > > > And that enables us to check compact_result to determine whether thp > > allocation should fail or continue to reclaim. I don't think it helps > > that some thp fault allocations use __GFP_NORETRY and others do not. I > > think that deserves a fix to alloc_hugepage_direct_gfpmask() or > > GFP_TRANSHUGE_LIGHT. > > > > I am concerned that we may be trying to deal with this in terms of right > and wrong when the range of workloads and their preferred semantics > force us into a grey area. > > The THP user without __GFP_NORETRY is "always" > > always > means that an application requesting THP will stall on > allocation failure and directly reclaim pages and compact > memory in an effort to allocate a THP immediately. This may be > desirable for virtual machines that benefit heavily from THP > use and are willing to delay the VM start to utilise them. > I'm not sure what you mean when you say the thp user without __GFP_NORETRY is "always". If you're referring to the defrag mode "always", __GFP_NORETRY is only possible for non-madvised memory regions. All other defrag modes exclude it, and I argue they incorrectly exclude it. I believe the userspace expectation, whether it is described well or not, is that defrag mode "always" simply gets the same behavior for defrag mode "madvise" but for everybody, not just MADV_HUGEPAGE regions. Then, "defer+madvise" triggers kswapd/kcompactd but compacts for MADV_HUGEPAGE, "defer" triggers kswapd/compactd and fails, and "never" fails. This, to me, seems like the most sane heuristic. Whether reclaim is helpful or not is what I'm questioning, I don't think that it benefits any user to have strict adherence to a documentation that causes workloads to severely regress both in fault and access latency when we know that direct reclaim is unlikely to make direct compaction free an entire pageblock. It's more likely than not that the reclaim was pointless and the allocation will still fail. If memory compaction were patched such that it can report that it could successfully free a page of the specified order if there were free pages at the end of the zone it could migrate to, reclaim might be helpful. But with the current implementation, I don't think that is reliably possible. These free pages could easily be skipped over by the migration scanner because of the presence of slab pages, for example, and unavailable to the freeing scanner. > Removing __GFP_NORETRY in this instance is due to the tuning hinting that > direct reclaim is prefectly acceptable. __GFP_NORETRY means that the kernel > will not compact memory after a recent failure but if the caller is willing > to reclaim memory, then it follows that retrying compaction is justified. > > Is this the correct thing to do in 100% of cases? Probably not, there > will be corner cases but it also does not necessarily follow that > __GFP_NORETRY should be univeral. > For reclaim to be useful, I believe we need a major rewrite of memory compaction such that the freeing scanner is eliminated and it can benefit from memory passed over by the migration scanner. In this case, it would require the freeing scanner to do the actual reclaim itself. I don't believe it is beneficial to any user that we reclaim memory from the zone when we don't know (1) that the migration scanner will eventually be able to free an entire pageblock, (2) the reclaimed memory can be accessed by the freeing scanner as a migration target, and (3) all this synchronous work was not done on behalf of a user only to lose the hugepage to a concurrent allocator that is not even madvised. Since we lack all of this in the allocator/compaction/reclaim, it seems that the painful fault latency here can be corrected by doing what is actually useful, memory compaction, and rely on its heuristics of when to give up and when to proceed rather than thrash the zone. The insane fault latency being reported here is certainly not what the user is asking for when it does MADV_HUGEPAGE, and removing __GFP_THISNODE doesn't help in any way if remote memory is fragmented or low on memory as well. > What is missing here is an agreed upon set of reference test cases > that can be used for the basis of evaluating patches like this. They > should be somewhat representative of the target applications of virtual > memory initialisation (forget about runtime at the moment as that is > stacking problems), a simulator of the google workload and library and > my test case of simply referencing an amount of memory larger than one > node. That would cover the current discussion at least but more would be > needed later. Otherwise we're going to endlessly whack-a-mole fixing one > workload and hurting another. It might be overkill but otherwise this > discussion risks going in circles. > Considering only fault latency, it's relatively trivial to fragment all zones and then measure the time it takes to start your workload. This is where the 40% regression I reported came from. Having tons of memory free remotely yet having the local node under such strenuous memory pressure that compaction cannot even find free pages isn't a healthy balance. Perhaps we do not fault remote memory as easily because there is not an imbalance. My criticism is obviously based on the fault and access latency regressions that this introduces but is also focused on what makes sense for the current implementation of the page allocator, memory compaction, and direct reclaim. I know that direct reclaim is these contexts will very, very seldom help memory compaction because it requires strenuous memory pressure (otherwise compaction would have tried and failed to defrag memory for other reasons) and under those situations we have data that shows slab fragmentation is much more likely to occur such that compaction will never be successful. Situations where a node has 1.5GB of slab yet 100GB of pageblocks have 1+ slab pages because the allocator prefers to return node local memory at the cost of fragmenting MIGRATE_MOVABLE pageblocks. I really think that for this patch to be merged over my proposed change that it needs to be clearly demonstrated that reclaim was successful in that it freed memory that was subsequently available to the compaction freeing scanner and that enabled entire pageblocks to become free. That, in my experience, will very very seldom be successful because of internal slab fragmentation, compaction_alloc() cannot soak up pages from the reclaimed memory, and potentially thrash the zone completely pointlessly. The last point is the problem being reported here, but the other two are as legitimate. > Which is better? > > locality is more important if your workload fits in memory and the > initialisation phase is not performance-critical. It would > also be more important if your workload is larger than a > node but the critical working set fits within a node while > the other usages are like streaming readers and writers where > the data is referenced once and can be safely reclaimed later. > huge pages is more important if your workload is virtualised and > benefits heavily due to reduced TLB cost from EPT. > Similarly it's better if the workload has high special locality, > particularly if it is also fitting within a cache where the > remote cost is masked. > > These are simple examples and even then we cannot detect which case > applies in advance so it falls back to what does the hint mean? The name > suggests that huge pages are desirable and the locality is a hint. Granted, > if that means THP is going remote and incurring cost there, it might be > worse overall for some workloads, particularly if the system is fragmented. > > This goes back to my point that the MADV_HUGEPAGE hint should not make > promises about locality and that introducing MADV_LOCAL for specialised > libraries may be more appropriate with the initial semantic being how it > treats MADV_HUGEPAGE regions. > Forget locality, the patch incurs a 40% fault regression when remote memory is fragmented as well. Thrashing the local node will begin to thrash the remote nodes as well if they are balanced and fragmented in the same way. This cannot possibly be in the best interest of the user. What is in the best interest of the user is what can actually be gained by incurring that latency. We talked about a "privileged" madvise mode that is even stronger with regard to locality before, which I don't think is needed. If you have the privilege you could always use drop_caches, explicitly trigger MIGRATE_SYNC compaction yourself, speedup khugepaged in that background temporarily, etc. You could build upon my patch and say that compaction is beneficial, but we can still remove __GFP_THISNODE to get remote hugepages. But let us please focus on the role that reclaim has when allocating a hugepage coupled with the implementation of the page allocator. When a user reports this 40% fault latency and 13.9% access latency regressions if this patch is merged, what is the suggested fix? Wait for another madvise mode and then patch your binary, assuming you can modify the source, and then ask for a privilege so it can stress reclaim locally? > What you're asking for asking is an astonishing amount of work -- > a multi platform study against a wide variety of workloads with the > addition that some test should be able to start in a fragmented state > that is reproducible. > No, I'm asking that you show in the implementation of the page allocator, memory compaction, and direct reclaim why the work done by reclaim (the zone thrashing and swapping) is proven beneficial that it leads to the allocation of a hugepage. I'm not a fan of thrashing a zone pointlessly because we think the compaction freeing scanner may be able to allocate and the migration scanner may then suddenly free an entire pageblock. If that feedback loop can be provided, reclaim sounds worthwhile. Until then, we can't do such pointless work. > What is being asked of you is to consider introducing a new madvise hint and > having MADV_HUGEPAGE being about huge pages and introducing a new hint that > is hinting about locality without the strictness of memory binding. That > is significantly more tractable and you also presumably have access to > a reasonable reproduction cases even though I doubt you can release it. > We continue to talk about the role of __GFP_THISNODE. The Andrea patch is obviously premised on the idea that for some reason remote memory is much less fragmented or not under memory pressure. If the nodes are actually balanced, you've incurred a 40% fault latency absolutely and completely pointlessly. What I'm focusing on is the role of reclaim, not __GFP_THISNODE. Let's fix the issue being reported and look at the implementation of memory compaction to understand why it is happening: reclaim very very seldom can assist memory compaction in making a pageblock free under these conditions. It's not beneficial and should not be done. > > > Like Mel said, your app just happens to fit in a local node, if the > > > user of the lib is slightly different and allocates 16G on a system > > > where each node is 4G, the post-fix MADV_HUGEPAGE will perform > > > extremely better also for the lib user. > > > > > > > It won't if the remote memory is fragmented; the patch is premised on the > > argument that remote memory is never fragmented or under memory pressure > > otherwise it is multiplying the fault latency by the number of nodes. > > Sure, creating test cases where the local node is under heavy memory > > pressure yet remote nodes are mostly free will minimize the impact this > > has on fault latency. It still is a substantial increase, as I've > > measured on Haswell, but the access latency forever is the penalty. This > > patch cannot make access to remote memory faster. > > > > Indeed not but remote costs are highly platform dependant so decisions > made for haswell are not necessarily justified on EPYC or any later Intel > processor either. I'm reporting regressions for the Andrea patch on Haswell, Naples, and Rome. If you have numbers for any platform that suggests remote huge memory (and of course there's no consideration of NUMA distance here, so 2-hop would be allowed as well by this patch) is beneficial over local native pages, please present it. > Because these goalposts will keep changing, I think > it's better to have more clearly defined madvise hints and rely less on > implementation details that are subject to change. Otherwise we will get > stuck in an endless debate about "is workload X more important than Y"? > The clearly defined madvise is not helpful if it causes massive thrashing of *any* zone on the system and cannot enable memory compaction to directly make a hugepage available. In other words, show that I didn't just reclaim 60GB of memory that still causes compaction to fail to free a pageblock, which is precisely what happens today, either because slab fragmentation as the result of the memory pressure that we are under or because it is inaccessible to the freeing scanner. > I don't think it's necessarily bad but it cannot distinguish between > THP and hugetlbfs. Hugetlbfs users are typically more willing to accept > high overheads as they may be required for the application to function. > That's probably fixable but will still leave us in the state where > MADV_HUGEPAGE is also a hint about locality. It'd still be interesting > to hear if it fixes the VM initialisation issue but do note that if this > patch is used as a replacement that hugetlbfs users may complain down > the line. > Hugetlb memory is allocated round-robin over the set of allowed nodes with __GFP_THISNODE so it will already result in this thrashing problem that is being reported with the very unlikely possibility that reclaim has helped us allocate a hugepage. And this does distinguish between thp memory and hugetlb memory, thp is explicitly allocated here with __GFP_NORETRY, which controls the goto nopage, and hugetlb memory is allocated with __GFP_RETRY_MAYFAIL. I'd hate to add a gfp flag unnecessarily for this since we're talking about the role of reclaim in assisting memory compaction to make a high-order page available. That is clearly defined in my patch to be based on __GFP_NORETRY. I'd appreciate if Andrea can test this patch, have a rebuttal that we should still remove __GFP_THISNODE because we don't care about locality as much as forming a hugepage, we can make that change, and then merge this instead of causing such massive fault and access latencies.
On Mon, Oct 08, 2018 at 01:41:09PM -0700, David Rientjes wrote: > The page allocator is expecting __GFP_NORETRY for thp allocations per its > comment: > > /* > * Checks for costly allocations with __GFP_NORETRY, which > * includes THP page fault allocations > */ > if (costly_order && (gfp_mask & __GFP_NORETRY)) { > > And that enables us to check compact_result to determine whether thp > allocation should fail or continue to reclaim. I don't think it helps > that some thp fault allocations use __GFP_NORETRY and others do not. I > think that deserves a fix to alloc_hugepage_direct_gfpmask() or > GFP_TRANSHUGE_LIGHT. With your patch that changes how __GFP_NORETRY works for order = pageblock_order then this can help. However it solves nothing for large skbs and all other "costly_order" allocations. So if you think it's so bad to have a 40% increased allocation latency if all nodes are heavily fragmented for THP under MADV_HUGEPAGE (which are guaranteed to be long lived allocations or they wouldn't set MADV_HUGEPAGE in the first place), I'm not sure why you ignore the same exact overhead for all other "costly_order" allocations that would never set __GFP_THISNODE and will not even use GFP_TRANSHUGE_LIGHT so they may not set __GFP_NORETRY at all. I think it would be better to view the problem from a generic "costly_order" allocation prospective without so much "hardcoded" focus on THP MADV_HUGEPAGE only (which in fact are the least concern of all in terms of that 40% latency increase if all nodes are totally fragmented and compaction fails on all nodes, because they're long lived so the impact of the increased latency is more likely to be lost in the noise). > Our library that uses MADV_HUGEPAGE only invokes direct compaction because > we're on an older kernel: it does not attempt to do reclaim to make > compaction happy so that it finds memory that it can migrate memory to. > For reference, we use defrag setting of "defer+madvise". Everybody who > does not use MADV_HUGEPAGE kicks off kswapd/kcompactd and fails, > MADV_HUGEPAGE users do the same but also try direct compaction. That > direct compaction uses a mode of MIGRATE_ASYNC so it normally fails > because of need_resched() or spinlock contention. > These allocations always fail, MADV_HUGEPAGE or otherwise, without > invoking direct reclaim. Even older kernels (before "defer+madvise" option was available) always invoked reclaim if COMPACT_SKIPPED was returned. The COMPACT_SKIPPED behavior I referred that explains why __GFP_NORETRY doesn't prevent reclaim to be invoked, is nothing new, it always worked that way from the day compaction was introduced. So it's fairly strange that your kernel doesn't call reclaim at all if COMPACT_SKIPPED is returned. > I am agreeing with both you and Mel that it makes no sense to thrash the > local node to make compaction happy and then hugepage-order memory > available. I'm only differing with you on the mechanism to fail early: we > never want to do attempt reclaim on thp allocations specifically because > it leads to the behavior you are addressing. Not sure I can agree with the above. If all nodes (not only the local one) are already below the watermarks and compaction returns COMPACT_SKIPPED, there is zero global free memory available, we would need to swap anyway to succeed the 2M allocation. So it's better to reclaim 2M from on node and then retry compaction again on the same node if compaction is failing on the node because of COMPACT_SKIPPED and the global free memory is zero. If it swapouts it would have swapped out anyway but this way THP will be returned. It's just __GFP_THISNODE that broke the above logic. __GFP_THISNODE may just be safe to use only if the global amount of free memory (in all nodes) is zero (or below HPAGE_PMD_SIZE generally speaking). > My contention is that removing __GFP_THISNODE papers over the problem, > especially in cases where remote memory is also fragmnented. It incurs a > much higher (40%) fault latency and then incurs 13.9% greater access > latency. It is not a good result, at least for Haswell, Naples, and Rome. > > To make a case that we should fault hugepages remotely as fallback, either > for non-MADV_HUGEPAGE users who do not use direct compaction, or > MADV_HUGEPAGE users who use direct compaction, we need numbers that > suggest there is a performance benefit in terms of access latency to > suggest that it is better; this is especially the case when the fault > latency is 40% higher. On Haswell, Naples, and Rome, it is quite obvious > that this patch works much harder to fault memory remotely that incurs a > substantial performance penalty when it fails and in the cases where it > succeeds we have much worse access latency. What we're fixing is a effectively a breakage and insecure behavior too and that must be fixed first. A heavily multithreaded processes with its memory not fitting in a single NUMA node is too common of a workload to misbehave so bad as it did. The above issue is a much smaller concern and it's not a breakage nor insecure. __GFP_THISNODE semantics are not enough to express what you need there. Like Mel said we need two tests, one that represents the "pathological THP allocation behavior" which is trivial (any heavy multithreaded app accessing all memory of all nodes from all CPUs will do). And one that represents your workload. By thinking about it, it's almost impossible that eliminating 100% of swapouts from the multithreaded app runtime, will not bring a benefit that is orders of magnitude bigger than the elimination of remote compaction and having more local THP pages for your app. It's just on a different scale and the old behavior was insecure to begin with because it allowed a process to heavily penalize the runtime of other apps running with different uid, without requiring any privilege. > For users who bind their applications to a subset of cores on a NUMA node, > fallback is egregious: we are guaranteed to never have local access > latency and in the case when the local node is fragmented and remote > memory is not our fault latency goes through the roof when local native > pages would have been much better. > > I've brought numbers on how poor this patch performs, so I'm asking for a > rebuttal that suggests it is better on some platforms. (1) On what > platforms is it better to fault remote hugepages over local native pages? > (2) What is the fault latency increase when remote nodes are fragmented as > well? That I don't think is the primary issue, the issue here is the one of the previous paragrah. Then there may be workloads and/or platforms where allocating remote THP instead of local PAGE_SIZEd memory is better regardless and __GFP_THISNODE hurted twice (but we could ignore those for now, and just focus where __GFP_THISNODE provided a benefit like for your app using the lib). > > Like Mel said, your app just happens to fit in a local node, if the > > user of the lib is slightly different and allocates 16G on a system > > where each node is 4G, the post-fix MADV_HUGEPAGE will perform > > extremely better also for the lib user. > > > > It won't if the remote memory is fragmented; the patch is premised on the How could compaction run slower than heavy swapping gigabytes of memory? > argument that remote memory is never fragmented or under memory pressure > otherwise it is multiplying the fault latency by the number of nodes. > Sure, creating test cases where the local node is under heavy memory > pressure yet remote nodes are mostly free will minimize the impact this > has on fault latency. It still is a substantial increase, as I've > measured on Haswell, but the access latency forever is the penalty. This > patch cannot make access to remote memory faster. I still doubt you measured the performance of the app using the lib when the app requires more 4 times the size of the local NUMA node under "taskset -c 0". If you do that, the app using the lib shall fall into the "pathological THP allocation behavior" too. As long as the spillover is 10-20% perhaps it's not as bad and if it's not heavily multithreaded doing scattered random access like a virtual machine will do, NUMA balancing will hide the issue by moving the thread to CPU of the NUMA node where the 10-20% spillover happened (which is why I asked "taskset -c 0" to simulate a multithreaded app accessing all memory in a scattered way without locality. The multithreaded app with scattered access is not unreasonable workload and it must perform well: all workloads using MPOL_INTERLEAVE would fit the criteria, not just qemu. > > And you know, if the lib user fits in one node, it can use mbind and > > it won't hit OOM... and you'd need some capability giving the app > > privilege anyway to keep MADV_HUGEPAGE as deep and unfair to the rest > > of the processes running the local node (like mbind and mlock require > > too). > > > > No, the lib user does not fit into a single node, we have cases where > these nodes are under constant memory pressure. We are on an older > kernel, however, that fails because of need_resched() and spinlock > contention rather than falling back to local reclaim. It's unclear which need_resched() and spinlock contention induces this failure. > > Did you try the __GFP_COMPACT_ONLY patch? That won't have the 40% > > fault latency already. > > > > I remember Kirill saying that he preferred node local memory allocation > over thp allocation in the review, which I agree with, but it was much > better than this patch. I see no reason why we should work so hard to > reclaim memory, thrash local memory, and swap so that the compaction > freeing scanner can find memory when there is *no* guarantee that the > migration scanner can free an entire pageblock. That is completely > pointless work, even for an MADV_HUGEPAGE user, and we certainly don't > have such pathological behavior on older kernels. The compaction scanner is succeeding at migrating the entire pageblock every time. The local node isn't so fragmented, it's just full of anonymous memory that is getting swapped out at every further invocation of reclaim. If it was fragmented __GFP_NORETRY would have had an effect because compaction would have been deferred. > Completely agreed, and I'm very strongly suggesting that we do not invoke > reclaim. Memory compaction provides no guarantee that it can free an > entire pageblock, even for MIGRATE_ASYNC compaction under MADV_HUGEPAGE it > will attempt to migrate SWAP_CLUSTER_MAX pages from a pageblock without > considering if the entire pageblock could eventually become freed for > order-9 memory. The vast majority of the time this reclaim could be > completely pointless. We cannot use it. If we don't invoke reclaim __GFP_THISNODE is fine to keep, however that means direction compaction doesn't work at all if all nodes are full of filesystem cache. This was a drawback of __GFP_COMPACT_ONLY too. A minor one compared to the corner case created by __GFP_THISNODE. > We don't need __GFP_NORETRY anymore with this, it wouldn't be useful to > continually compact memory in isolation if it already has failed. It > seems strange to be using __GFP_DIRECT_RECLAIM | __GFP_ONLY_COMPACT, > though, as I assume this would evolve into. > > Are there any other proposed users for __GFP_ONLY_COMPACT beyond thp > allocations? If not, we should just save the gfp bit and encode the logic > directly into the page allocator. Yes that's seems equivalent, just it embeds __GFP_COMPACT_ONLY in __GFP_NORETRY and ends up setting __GFP_NORETRY for everything, __GFP_COMPACT_ONLY didn't alter the __GFP_NORETRY semantics instead. > Would you support this? I didn't consider it a problem the 40% increased allocation latency, I thought __GFP_COMPACT_ONLY or the other fix in -mm was just about the remote THP vs local PAGE_SIZEd memory tradeoff (considering this logic was already hardcoded only for THP so I tried to keep it). Now that you raise this higher compaction latency as a big regression in not fixing the VM with __GFP_COMPACT_ONLY semantics in __GFP_NORETRY, the next question is why you don't care at all about such increased latency for all other "costly_order" allocations that won't use __GFP_THISNODE and where your altered __GFP_NORETRY won't help to skip compaction on the remote nodes. In other words if such compaction invocation in fully fragmented remote nodes is such a big concern for THP, why is not an issue at all for all other "costly_order" allocations? > + if (order == pageblock_order && > + !(current->flags & PF_KTHREAD)) > + goto nopage; For a more reliable THP hardcoding HPAGE_PMD_ORDER would be better than pageblock_order (I guess especially for non-x86 archs where the match between hugetlbfs size and thp size is less obvious), but then the hardcoding itself now looks the worst part of this patch if the concern is the 40% increased allocation latency (and not only the remote compound page vs local 4k page). Thanks, Andrea
On Tue, Oct 09, 2018 at 03:17:30PM -0700, David Rientjes wrote: > causes workloads to severely regress both in fault and access latency when > we know that direct reclaim is unlikely to make direct compaction free an > entire pageblock. It's more likely than not that the reclaim was > pointless and the allocation will still fail. How do you know that? If all RAM is full of filesystem cache, but it's not heavily fragmented by slab or other unmovable objects, compaction will succeed every single time after reclaim frees 2M of cache like it's asked to do. reclaim succeeds every time, compaction then succeeds every time. Not doing reclaim after COMPACT_SKIPPED is returned simply makes compaction unable to compact memory once all nodes are filled by filesystem cache. Certainly it's better not to invoke reclaim at all if __GFP_THISNODE is set, than swapping out heavy over the local node. Doing so however has the drawback of reducing the direct compaction effectiveness. I don't think it's true that reclaim is generally "pointless", it's just that invoking any reclaim backfired so bad if __GFP_THISNODE was set, than anything else (including weakining compaction effectiveness) was better. > If memory compaction were patched such that it can report that it could > successfully free a page of the specified order if there were free pages > at the end of the zone it could migrate to, reclaim might be helpful. But > with the current implementation, I don't think that is reliably possible. > These free pages could easily be skipped over by the migration scanner > because of the presence of slab pages, for example, and unavailable to the > freeing scanner. Yes there's one case where reclaim is "pointless", but it happens once and then COMPACT_DEFERRED is returned and __GFP_NORETRY will skip reclaim then. So you're right when we hit fragmentation there's one and only one "pointless" reclaim invocation. And immediately after we also exponentially backoff on the compaction invocations with the compaction deferred logic. We could try optimize away such "pointless" reclaim event for sure, but it's probably an optimization that may just get lost in the noise and may not be measurable, because it only happens once when the first full fragmentation is encountered. > I really think that for this patch to be merged over my proposed change > that it needs to be clearly demonstrated that reclaim was successful in > that it freed memory that was subsequently available to the compaction > freeing scanner and that enabled entire pageblocks to become free. That, > in my experience, will very very seldom be successful because of internal > slab fragmentation, compaction_alloc() cannot soak up pages from the > reclaimed memory, and potentially thrash the zone completely pointlessly. > The last point is the problem being reported here, but the other two are > as legitimate. I think the demonstration can already be inferred, because if hit full memory fragmentation after every reclaim, __GFP_NORETRY would have solved the "pathological THP allocation behavior" without requiring your change to __GFP_NORETRY that makes it behave like __GFP_COMPACT_ONLY for order == HPAGE_PMD_ORDER. Anyway you can add a few statistic counters and verify in more accurate way how often a COMPACT_SKIPPED + reclaim cycle is followed by a COMPACT_DEFERRED. > I'd appreciate if Andrea can test this patch, have a rebuttal that we > should still remove __GFP_THISNODE because we don't care about locality as > much as forming a hugepage, we can make that change, and then merge this > instead of causing such massive fault and access latencies. I can certainly test, but from source review I'm already convinced it'll solve fine the "pathological THP allocation behavior", no argument about that. It's certainly better and more correct your patch than the current upstream (no security issues with lack of permissions for __GFP_THISNODE anymore either). I expect your patch will run 100% equivalent to __GFP_COMPACT_ONLY alternative I posted, for our testcase that hit into the "pathological THP allocation behavior". Your patch encodes __GFP_COMPACT_ONLY into the __GFP_NORETRY semantics and hardcodes the __GFP_COMPACT_ONLY for all orders = HPAGE_PMD_SIZE no matter which is the caller. As opposed I let the caller choose and left __GFP_NORETRY semantics alone and orthogonal to the __GFP_COMPACT_ONLY semantics. I think letting the caller decide instead of hardcoding it for order 9 is better, because __GFP_COMPACT_ONLY made sense to be set only if __GFP_THISNODE was also set by the caller. If a driver does an order 9 allocation with __GFP_THISNODE not set, your patch will prevent it to allocate remote THP if all remote nodes are full of cache (which is a reasonable common assumption as more THP are allocated over time eating in all free memory). My patch didn't alter that so I tend to prefer the __GFP_COMPACT_ONLY than the hardcoding __GFP_COMPACT_ONLY for all order 9 allocations regardless if __GFP_THISNODE is set or not. Last but not the last, from another point of view, I thought calling remote compaction was a feature especially for MADV_HUGEPAGE. However if avoiding the 40% increase latency for MADV_HUGEPAGE was the primary motivation for __GFP_THISNODE, not sure how we can say that such an high allocation latency is only a concern for order 9 allocations and all other costly_order allocations (potentially with orders even higher than HPAGE_PMD_ORDER in fact) are ok to keep calling remote compaction and incur in the 40% higher latency. Thanks, Andrea
On Tue, Oct 09, 2018 at 04:25:10PM +0200, Michal Hocko wrote: > On Tue 09-10-18 14:00:34, Mel Gorman wrote: > > On Tue, Oct 09, 2018 at 02:27:45PM +0200, Michal Hocko wrote: > > > [Sorry for being slow in responding but I was mostly offline last few > > > days] > > > > > > On Tue 09-10-18 10:48:25, Mel Gorman wrote: > > > [...] > > > > This goes back to my point that the MADV_HUGEPAGE hint should not make > > > > promises about locality and that introducing MADV_LOCAL for specialised > > > > libraries may be more appropriate with the initial semantic being how it > > > > treats MADV_HUGEPAGE regions. > > > > > > I agree with your other points and not going to repeat them. I am not > > > sure madvise s the best API for the purpose though. We are talking about > > > memory policy here and there is an existing api for that so I would > > > _prefer_ to reuse it for this purpose. > > > > > > > I flip-flopped on that one in my head multiple times on the basis of > > how strict it should be. Memory policies tend to be black or white -- > > bind here, interleave there, etc. It wasn't clear to me what the best > > policy would be to describe "allocate local as best as you can but allow > > fallbacks if necessary". MPOL_PREFERRED is not black and white. In fact I asked David earlier if MPOL_PREFERRED could check if it would already be a good fit for this. Still the point is it requires privilege (and for a good reason). > I was thinking about MPOL_NODE_PROXIMITY with the following semantic: > - try hard to allocate from a local or very close numa node(s) even when > that requires expensive operations like the memory reclaim/compaction > before falling back to other more distant numa nodes. If MPOL_PREFERRED can't work something like this could be added. I think "madvise vs mbind" is more an issue of "no-permission vs permission" required. And if the processes ends up swapping out all other process with their memory already allocated in the node, I think some permission is correct to be required, in which case an mbind looks a better fit. MPOL_PREFERRED also looks a first candidate for investigation as it's already not black and white and allows spillover and may already do the right thing in fact if set on top of MADV_HUGEPAGE. Thanks, Andrea
On 10/10/18 12:51 AM, Andrea Arcangeli wrote: > Yes there's one case where reclaim is "pointless", but it happens once > and then COMPACT_DEFERRED is returned and __GFP_NORETRY will skip > reclaim then. > > So you're right when we hit fragmentation there's one and only one > "pointless" reclaim invocation. And immediately after we also > exponentially backoff on the compaction invocations with the > compaction deferred logic. > > We could try optimize away such "pointless" reclaim event for sure, > but it's probably an optimization that may just get lost in the noise > and may not be measurable, because it only happens once when the first > full fragmentation is encountered. Note there's a small catch in the above. defer_compaction() has always only been called after a failure on higher priority than COMPACT_PRIO_ASYNC, where it's assumed that async compaction can terminate prematurely due to a number of reasons, so it doesn't mean that the zone itself cannot be compacted. And, for __GFP_NORETRY, if the initial compaction fails, we keep using async compaction also for the second, after-reclaim attempt (which would otherwise use SYNC_LIGHT): /* * Looks like reclaim/compaction is worth trying, but * sync compaction could be very expensive, so keep * using async compaction. */ compact_priority = INIT_COMPACT_PRIORITY; This doesn't affect current madvised THP allocation which doesn't use __GFP_NORETRY, but could explain why you saw no benefit from changing it to __GFP_NORETRY.
On Tue, 9 Oct 2018, Andrea Arcangeli wrote: > On Tue, Oct 09, 2018 at 03:17:30PM -0700, David Rientjes wrote: > > causes workloads to severely regress both in fault and access latency when > > we know that direct reclaim is unlikely to make direct compaction free an > > entire pageblock. It's more likely than not that the reclaim was > > pointless and the allocation will still fail. > > How do you know that? If all RAM is full of filesystem cache, but it's > not heavily fragmented by slab or other unmovable objects, compaction > will succeed every single time after reclaim frees 2M of cache like > it's asked to do. > > reclaim succeeds every time, compaction then succeeds every time. > > Not doing reclaim after COMPACT_SKIPPED is returned simply makes > compaction unable to compact memory once all nodes are filled by > filesystem cache. > For reclaim to assist memory compaction based on compaction's current implementation, it would require that the freeing scanner starting at the end of the zone can find these reclaimed pages as migration targets and that compaction will be able to utilize these migration targets to make an entire pageblock free. In such low on memory conditions when a node is fully saturated, it is much less likely that memory compaction can free an entire pageblock even if the freeing scanner can find these now-free pages. More likely is that we have unmovable pages in MIGRATE_MOVABLE pageblocks because the allocator allows fallback to pageblocks of other migratetypes to return node local memory (because affinity matters for kernel memory as it matters for thp) rather than fallback to remote memory. This has caused us significant pain where we have 1.5GB of slab, for example, spread over 100GB of pageblocks once the node has become saturated. So reclaim is not always "pointless" as you point out, but it should at least only be attempted if memory compaction could free an entire pageblock if it had free memory to migrate to. It's much harder to make sure that these freed pages can be utilized by the freeing scanner. Based on how memory compaction is implemented, I do not think any guarantee can be made that reclaim will ever be successful in allowing it to make order-9 memory available, unfortunately. > > If memory compaction were patched such that it can report that it could > > successfully free a page of the specified order if there were free pages > > at the end of the zone it could migrate to, reclaim might be helpful. But > > with the current implementation, I don't think that is reliably possible. > > These free pages could easily be skipped over by the migration scanner > > because of the presence of slab pages, for example, and unavailable to the > > freeing scanner. > > Yes there's one case where reclaim is "pointless", but it happens once > and then COMPACT_DEFERRED is returned and __GFP_NORETRY will skip > reclaim then. > > So you're right when we hit fragmentation there's one and only one > "pointless" reclaim invocation. And immediately after we also > exponentially backoff on the compaction invocations with the > compaction deferred logic. > This assumes that every time we get COMPACT_SKIPPED that if we can simply free memory that it'll succeed and that's definitely not the case based on the implementation of memory compaction: compaction_alloc() needs to find the memory and the migration scanner needs to free an entire pageblock. The migration scanner doesn't even look ahead to see if that's possible before starting to migrate pages, it's limited to COMPACT_CLUSTER_MAX. The scenario we have: compaction returns COMPACT_SKIPPED; reclaim expensively tries to reclaim memory by thrashing the local node; the compaction migration scanner has already passed over the now-freed pages so it's inaccessible; if accessible, the migration scanner migrates memory to the newly freed pages but fails to make a pageblock free; loop. My contention is that the second step is only justified if we can guarantee the freed memory can be useful for compaction and that it can free an entire pageblock for the hugepage if it can migrate. Both of those are not possible to determine based on the current implementation. > > I'd appreciate if Andrea can test this patch, have a rebuttal that we > > should still remove __GFP_THISNODE because we don't care about locality as > > much as forming a hugepage, we can make that change, and then merge this > > instead of causing such massive fault and access latencies. > > I can certainly test, but from source review I'm already convinced > it'll solve fine the "pathological THP allocation behavior", no > argument about that. It's certainly better and more correct your patch > than the current upstream (no security issues with lack of permissions > for __GFP_THISNODE anymore either). > > I expect your patch will run 100% equivalent to __GFP_COMPACT_ONLY > alternative I posted, for our testcase that hit into the "pathological > THP allocation behavior". > > Your patch encodes __GFP_COMPACT_ONLY into the __GFP_NORETRY semantics > and hardcodes the __GFP_COMPACT_ONLY for all orders = HPAGE_PMD_SIZE > no matter which is the caller. > > As opposed I let the caller choose and left __GFP_NORETRY semantics > alone and orthogonal to the __GFP_COMPACT_ONLY semantics. I think > letting the caller decide instead of hardcoding it for order 9 is > better, because __GFP_COMPACT_ONLY made sense to be set only if > __GFP_THISNODE was also set by the caller. > I've hardcoded it directly for pageblock_order because compaction works over pageblocks and we lack the two crucial points of information I've stated that determines whether direct reclaim could possibly be useful. (It's more correctly implemented as order >= pageblock_order as opposed to order == pageblock_order.) > If a driver does an order 9 allocation with __GFP_THISNODE not set, > your patch will prevent it to allocate remote THP if all remote nodes > are full of cache (which is a reasonable common assumption as more THP > are allocated over time eating in all free memory). Iff it's using __GFP_NORETRY, yes, the allocation and remote access latency that is incurred is the same as for thp. Hugetlbfs doesn't use __GFP_NORETRY, it uses __GFP_RETRY_MAYFAIL, so it will attempt to reclaim memory after my patch. My patch is assuming that a single call to reclaim followed up by another attempt at compaction is not beneficial for pageblock_order sized allocations with __GFP_NORETRY because it's unlikely to either free an entire pageblock and compaction may not be able to access this memory. It's based on how memory compaction is implemented rather than any special heuristic. I won't argue if you protect this logic with __GFP_COMPACT_ONLY, but I think thp allocations should always have __GFP_NORETRY based on compaction's implementation. I see removing __GFP_THISNODE as a separate discussion: if, after my patch (perhaps with a modification for __GFP_COMPACT_ONLY on top of it), you still get unacceptable fault latency and can show that remote access latency to remotely allocated hugepage is better on some platform that isn't Haswell, Naples, and Rome, we can address that but it will probably require more work that simply unsetting __GFP_THISNODE because it will depend on the latency to certain remote nodes over others.
On Tue, 9 Oct 2018, Andrea Arcangeli wrote: > I think "madvise vs mbind" is more an issue of "no-permission vs > permission" required. And if the processes ends up swapping out all > other process with their memory already allocated in the node, I think > some permission is correct to be required, in which case an mbind > looks a better fit. MPOL_PREFERRED also looks a first candidate for > investigation as it's already not black and white and allows spillover > and may already do the right thing in fact if set on top of > MADV_HUGEPAGE. > We would never want to thrash the local node for hugepages because there is no guarantee that any swapping is useful. On COMPACT_SKIPPED due to low memory, we have very clear evidence that pageblocks are already sufficiently fragmented by unmovable pages such that compaction itself, even with abundant free memory, fails to free an entire pageblock due to the allocator's preference to fragment pageblocks of fallback migratetypes over returning remote free memory. As I've stated, we do not want to reclaim pointlessly when compaction is unable to access the freed memory or there is no guarantee it can free an entire pageblock. Doing so allows thrashing of the local node, or remote nodes if __GFP_THISNODE is removed, and the hugepage still cannot be allocated. If this proposed mbind() that requires permissions is geared to me as the user, I'm afraid the details of what leads to the thrashing are not well understood because I certainly would never use this.
On Wed, 10 Oct 2018, David Rientjes wrote: > > I think "madvise vs mbind" is more an issue of "no-permission vs > > permission" required. And if the processes ends up swapping out all > > other process with their memory already allocated in the node, I think > > some permission is correct to be required, in which case an mbind > > looks a better fit. MPOL_PREFERRED also looks a first candidate for > > investigation as it's already not black and white and allows spillover > > and may already do the right thing in fact if set on top of > > MADV_HUGEPAGE. > > > > We would never want to thrash the local node for hugepages because there > is no guarantee that any swapping is useful. On COMPACT_SKIPPED due to > low memory, we have very clear evidence that pageblocks are already > sufficiently fragmented by unmovable pages such that compaction itself, > even with abundant free memory, fails to free an entire pageblock due to > the allocator's preference to fragment pageblocks of fallback migratetypes > over returning remote free memory. > > As I've stated, we do not want to reclaim pointlessly when compaction is > unable to access the freed memory or there is no guarantee it can free an > entire pageblock. Doing so allows thrashing of the local node, or remote > nodes if __GFP_THISNODE is removed, and the hugepage still cannot be > allocated. If this proposed mbind() that requires permissions is geared > to me as the user, I'm afraid the details of what leads to the thrashing > are not well understood because I certainly would never use this. > At the risk of beating a dead horse that has already been beaten, what are the plans for this patch when the merge window opens? It would be rather unfortunate for us to start incurring a 14% increase in access latency and 40% increase in fault latency. Would it be possible to test with my patch[*] that does not try reclaim to address the thrashing issue? If that is satisfactory, I don't have a strong preference if it is done with a hardcoded pageblock_order and __GFP_NORETRY check or a new __GFP_COMPACT_ONLY flag. I think the second issue of faulting remote thp by removing __GFP_THISNODE needs supporting evidence that shows some platforms benefit from this (and not with numa=fake on the command line :). [*] https://marc.info/?l=linux-kernel&m=153903127717471
On Mon, 15 Oct 2018 15:30:17 -0700 (PDT) David Rientjes <rientjes@google.com> wrote: > At the risk of beating a dead horse that has already been beaten, what are > the plans for this patch when the merge window opens? I'll hold onto it until we've settled on something. Worst case, Andrea's original is easily backportable. > It would be rather > unfortunate for us to start incurring a 14% increase in access latency and > 40% increase in fault latency. Yes. > Would it be possible to test with my > patch[*] that does not try reclaim to address the thrashing issue? Yes please. And have you been able to test it with the sort of workloads which Andrea is attempting to address?
On Mon, Oct 15, 2018 at 03:30:17PM -0700, David Rientjes wrote: > At the risk of beating a dead horse that has already been beaten, what are > the plans for this patch when the merge window opens? It would be rather > unfortunate for us to start incurring a 14% increase in access latency and > 40% increase in fault latency. Would it be possible to test with my > patch[*] that does not try reclaim to address the thrashing issue? If > that is satisfactory, I don't have a strong preference if it is done with > a hardcoded pageblock_order and __GFP_NORETRY check or a new > __GFP_COMPACT_ONLY flag. I don't like the pageblock size hardcoding inside the page allocator. __GFP_COMPACT_ONLY is fully runtime equivalent, but it at least let the caller choose the behavior, so it looks more flexible. To fix your 40% fault latency concern in the short term we could use __GFP_COMPACT_ONLY, but I think we should get rid of __GFP_COMPACT_ONLY later: we need more statistical data in the zone structure to track remote compaction failures triggering because the zone is fully fragmented. Once the zone is fully fragmented we need to do a special exponential backoff on that zone when the zone is from a remote node. Furthermore at the first remote NUMA node zone failure caused by full fragmentation we need to interrupt compaction and stop trying with all remote nodes. As long as compaction returns COMPACT_SKIPPED it's ok to keep doing reclaim and keep doing compaction, as long as compaction succeeds. What is causing the higher latency is the fact we try all zones from all remote nodes even if there's a failure caused by full fragmentation of all remote zone, and we're unable to skip (skip with exponential backoff) only those zones where compaction cannot succeed because of fragmentation. Once we achieve the above deleting __GFP_COMPACT_ONLY will be a trivial patch. > I think the second issue of faulting remote thp by removing __GFP_THISNODE > needs supporting evidence that shows some platforms benefit from this (and > not with numa=fake on the command line :). > > [*] https://marc.info/?l=linux-kernel&m=153903127717471 That is needed to compare the current one liner fix with __GFP_COMPACT_ONLY, but I don't think it's needed to compare v4.18 with the current fix. The badness of v4.18 was too bad keep, getting local PAGE_SIZEd memory or remote THPs is a secondary issue. In fact the main reason for __GFP_COMPACT_ONLY is not anymore such tradeoff, but not to spend too much CPU in compaction when all nodes are fragmented to avoid increasing the allocation latency too much. Thanks, Andrea
Hello Andrew, On Mon, Oct 15, 2018 at 03:44:59PM -0700, Andrew Morton wrote: > On Mon, 15 Oct 2018 15:30:17 -0700 (PDT) David Rientjes <rientjes@google.com> wrote: > > Would it be possible to test with my > > patch[*] that does not try reclaim to address the thrashing issue? > > Yes please. It'd also be great if a testcase reproducing the 40% higher access latency (with the one liner original fix) was available. We don't have a testcase for David's 40% latency increase problem, but that's likely to only happen when the system is somewhat low on memory globally. So the measurement must be done when compaction starts failing globally on all zones, but before the system starts swapping. The more global fragmentation the larger will be the window between "compaction fails because all zones are too fragmented" and "there is still free PAGE_SIZEd memory available to reclaim without swapping it out". If I understood correctly, that is precisely the window where the 40% higher latency should materialize. The workload that shows the badness in the upstream code is fairly trivial. Mel and Zi reproduced it too and I have two testcases that can reproduce it, one with device assignment and the other is just memhog. That's a massively larger window than the one where the 40% higher latency materializes. When there's 75% or more of the RAM free (not even allocated as easily reclaimable pagecache) globally, you don't expect to hit heavy swapping. The 40% THP allocation latency increase if you use MADV_HUGEPAGE in such window where all remote zones are fully fragmented is somehow lesser of a concern in my view (plus there's the compact deferred logic that should mitigate that scenario). Furthermore it is only a concern for page faults in MADV_HUGEPAGE ranges. If MADV_HUGEPAGE is set the userland allocation is long lived, so such higher allocation latency won't risk to hit short lived allocations that don't set MADV_HUGEPAGE (unless madvise=always, but that's not the default precisely because not all allocations are long lived). If the MADV_HUGEPAGE using library was freely available it'd also be nice.
On Mon, Oct 15, 2018 at 03:44:59PM -0700, Andrew Morton wrote: > On Mon, 15 Oct 2018 15:30:17 -0700 (PDT) David Rientjes <rientjes@google.com> wrote: > > > At the risk of beating a dead horse that has already been beaten, what are > > the plans for this patch when the merge window opens? > > I'll hold onto it until we've settled on something. Worst case, > Andrea's original is easily backportable. > I consider this to be an unfortunate outcome. On the one hand, we have a problem that three people can trivially reproduce with known test cases and a patch shown to resolve the problem. Two of those three people work on distributions that are exposed to a large number of users. On the other, we have a problem that requires the system to be in a specific state and an unknown workload that suffers badly from the remote access penalties with a patch that has review concerns and has not been proven to resolve the trivial cases. In the case of distributions, the first patch addresses concerns with a common workload where on the other hand we have an internal workload of a single company that is affected -- which indirectly affects many users admittedly but only one entity directly. At the absolute minimum, a test case for the "system fragmentation incurs access penalties for a workload" scenario that could both replicate the fragmentation and demonstrate the problem should have been available before the patch was rejected. With the test case, there would be a chance that others could analyse the problem and prototype some fixes. The test case was requested in the thread and never produced so even if someone were to prototype fixes, it would be dependant on a third party to test and produce data which is a time-consuming loop. Instead, we are more or less in limbo.
On Tue, 16 Oct 2018 08:46:06 +0100 Mel Gorman <mgorman@suse.de> wrote: > On Mon, Oct 15, 2018 at 03:44:59PM -0700, Andrew Morton wrote: > > On Mon, 15 Oct 2018 15:30:17 -0700 (PDT) David Rientjes <rientjes@google.com> wrote: > > > > > At the risk of beating a dead horse that has already been beaten, what are > > > the plans for this patch when the merge window opens? > > > > I'll hold onto it until we've settled on something. Worst case, > > Andrea's original is easily backportable. > > > > I consider this to be an unfortunate outcome. On the one hand, we have a > problem that three people can trivially reproduce with known test cases > and a patch shown to resolve the problem. Two of those three people work > on distributions that are exposed to a large number of users. On the > other, we have a problem that requires the system to be in a specific > state and an unknown workload that suffers badly from the remote access > penalties with a patch that has review concerns and has not been proven > to resolve the trivial cases. In the case of distributions, the first > patch addresses concerns with a common workload where on the other hand > we have an internal workload of a single company that is affected -- > which indirectly affects many users admittedly but only one entity directly. > > At the absolute minimum, a test case for the "system fragmentation incurs > access penalties for a workload" scenario that could both replicate the > fragmentation and demonstrate the problem should have been available before > the patch was rejected. With the test case, there would be a chance that > others could analyse the problem and prototype some fixes. The test case > was requested in the thread and never produced so even if someone were to > prototype fixes, it would be dependant on a third party to test and produce > data which is a time-consuming loop. Instead, we are more or less in limbo. > OK, thanks. But we're OK holding off for a few weeks, yes? If we do that we'll still make it into 4.19.1. Am reluctant to merge this while discussion, testing and possibly more development are ongoing.
Hello, On Tue, Oct 16, 2018 at 03:37:15PM -0700, Andrew Morton wrote: > we'll still make it into 4.19.1. Am reluctant to merge this while > discussion, testing and possibly more development are ongoing. I think there can be definitely more developments primarily to make the compact deferred logic NUMA aware. Instead of a global deferred logic, we should split it per zone per node so that it backs off exponentially with an higher cap in remote nodes. The current global "backoff" limit will still apply to the "local" zone compaction. Who would like to work on that? However I don't think it's worth waiting for that, because it's not a trivial change. Certainly we can't ship upstream in production with this bug, so if it doesn't get fixed upstream we'll fix it downstream first until the more developments are production ready. This was a severe regression compared to previous kernels that made important workloads unusable and it starts when __GFP_THISNODE was added to THP allocations under MADV_HUGEPAGE. It is not a significant risk to go to the previous behavior before __GFP_THISNODE was added, it worked like that for years. This was simply an optimization to some lucky workloads that can fit in a single node, but it ended up breaking the VM for others that can't possibly fit in a single node, so going back is safe. Thanks, Andrea
On Tue, 16 Oct 2018 19:11:49 -0400 Andrea Arcangeli <aarcange@redhat.com> wrote: > This was a severe regression > compared to previous kernels that made important workloads unusable > and it starts when __GFP_THISNODE was added to THP allocations under > MADV_HUGEPAGE. It is not a significant risk to go to the previous > behavior before __GFP_THISNODE was added, it worked like that for > years. 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node") was April 2015. That's a long time for a "severe regression" to go unnoticed?
On Tue 16-10-18 16:16:43, Andrew Morton wrote: > On Tue, 16 Oct 2018 19:11:49 -0400 Andrea Arcangeli <aarcange@redhat.com> wrote: > > > This was a severe regression > > compared to previous kernels that made important workloads unusable > > and it starts when __GFP_THISNODE was added to THP allocations under > > MADV_HUGEPAGE. It is not a significant risk to go to the previous > > behavior before __GFP_THISNODE was added, it worked like that for > > years.@s1@s2@s1 > > 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to > local node") was April 2015. That's a long time for a "severe > regression" to go unnoticed? Well, it gets some time to adopt changes in enterprise and we start seeing people reporting this issue. That is why I believe we should start with something really simple and stable tree backportable first and then build something more complex on top.
On Tue, Oct 16, 2018 at 03:37:15PM -0700, Andrew Morton wrote: > On Tue, 16 Oct 2018 08:46:06 +0100 Mel Gorman <mgorman@suse.de> wrote: > > I consider this to be an unfortunate outcome. On the one hand, we have a > > problem that three people can trivially reproduce with known test cases > > and a patch shown to resolve the problem. Two of those three people work > > on distributions that are exposed to a large number of users. On the > > other, we have a problem that requires the system to be in a specific > > state and an unknown workload that suffers badly from the remote access > > penalties with a patch that has review concerns and has not been proven > > to resolve the trivial cases. In the case of distributions, the first > > patch addresses concerns with a common workload where on the other hand > > we have an internal workload of a single company that is affected -- > > which indirectly affects many users admittedly but only one entity directly. > > > > At the absolute minimum, a test case for the "system fragmentation incurs > > access penalties for a workload" scenario that could both replicate the > > fragmentation and demonstrate the problem should have been available before > > the patch was rejected. With the test case, there would be a chance that > > others could analyse the problem and prototype some fixes. The test case > > was requested in the thread and never produced so even if someone were to > > prototype fixes, it would be dependant on a third party to test and produce > > data which is a time-consuming loop. Instead, we are more or less in limbo. > > > > OK, thanks. > > But we're OK holding off for a few weeks, yes? If we do that > we'll still make it into 4.19.1. Am reluctant to merge this while > discussion, testing and possibly more development are ongoing. > Without a test case that reproduces the Google case, we are a bit stuck. Previous experience indicates that just fragmenting memory is not enough to give a reliable case as unless the unmovable/reclaimable pages are "sticky", the normal reclaim can handle it. Similarly, the access pattern of the target workload is important as it would need to be something larger than L3 cache to constantly hit the access penalty. We do not know what the exact characteristics of the Google workload are but we know that a fix for three cases is not equivalent for the Google case. The discussion has circled around wish-list items such as better fragmentation control, node-aware compaction, improved compact deferred logic and lower latencies with little in the way of actual specifics of implementation or patches. Improving fragmentation control would benefit from a workload that actually fragments so the extfrag events can be monitored as well as maybe a dump of pageblocks with mixed pages. On node-aware compaction, that was not implemented initially simply because HighMem was common and that needs to be treated as a corner case -- we cannot safely migrate pages from zone normal to highmem. That one is relatively trivial to measure as it's a functional issue. However, backing off compaction properly to maximise allocation success rates while minimising allocation latency and access latency needs a live workload that is representative. Trivial cases like the java workloads, nas or usemem won't do as they either exhibit special locality or are streaming readers/writers. Memcache might work but the driver in that case is critical to ensure the access penalties are incurred. Again, a modern example is missing. As for why this took so long to discover, it is highly likely that it's due to VM's being sized such as they typically fit in a NUMA node so it would have avoided the worst case scenarios. Furthermore, a machine dedicated to VM's has fewer concerns with respect to slab allocations and unmovable allocations fragmenting memory long-term. Finally, the worst case scenarios are encountered when there is a mix of different workloads of variable duration which may be common in a Google-like setup with different jobs being dispatched across a large network but less so in other setups where a service tends to be persistent. We already know that some of the worst performance problems take years to discover.
On Mon, 15 Oct 2018, Andrea Arcangeli wrote: > > At the risk of beating a dead horse that has already been beaten, what are > > the plans for this patch when the merge window opens? It would be rather > > unfortunate for us to start incurring a 14% increase in access latency and > > 40% increase in fault latency. Would it be possible to test with my > > patch[*] that does not try reclaim to address the thrashing issue? If > > that is satisfactory, I don't have a strong preference if it is done with > > a hardcoded pageblock_order and __GFP_NORETRY check or a new > > __GFP_COMPACT_ONLY flag. > > I don't like the pageblock size hardcoding inside the page > allocator. __GFP_COMPACT_ONLY is fully runtime equivalent, but it at > least let the caller choose the behavior, so it looks more flexible. > I'm not sure that I understand why the user would ever want to thrash their zone(s) for allocations of this order. The problem here is specifically related to an entire pageblock becoming freeable and the unlikeliness that reclaiming/swapping/thrashing will assist memory compaction in making that happen. For this reason, I think the order >= pageblock_order check is reasonable because it depends on the implementation of memory compaction. Why do we need another gfp flag for thp allocations when they are made to be __GFP_NORETRY by default and it is very unlikely that reclaiming once and then retrying compaction is going to make an entire pageblock free? I'd like to know (1) how continuous reclaim activity can make entire pageblocks freeable without thrashing and (2) the metrics that we can use to determine when it is worthwhile vs harmful. I don't believe (1) is ever helpful based on the implementation of memory compaction and we lack (2) since reclaim is not targeted to memory that compaction can use. > As long as compaction returns COMPACT_SKIPPED it's ok to keep doing > reclaim and keep doing compaction, as long as compaction succeeds. > Compaction will operate on 32 pages at a time and declare success each time and then pick up where it left off the next time it is called in the hope that it "succeeds" 512/32=16 times in a row while constantly reclaiming memory. Even a single slab page in that pageblock will make all of this work useless. Reclaimed memory not accessible by the freeing scanner will make its work useless. We lack the ability to determine when compaction is successful in freeing a full pageblock.
On Mon, 15 Oct 2018, Andrea Arcangeli wrote: > > On Mon, 15 Oct 2018 15:30:17 -0700 (PDT) David Rientjes <rientjes@google.com> wrote: > > > Would it be possible to test with my > > > patch[*] that does not try reclaim to address the thrashing issue? > > > > Yes please. > > It'd also be great if a testcase reproducing the 40% higher access > latency (with the one liner original fix) was available. > I never said 40% higher access latency, I said 40% higher fault latency. The higher access latency is 13.9% as measured on Haswell. The test case is rather trivial: fragment all memory with order-4 memory to replicate a fragmented local zone, use sched_setaffinity() to bind to that node, and fault a reasonable number of hugepages (128MB, 256, whatever). The cost of faulting remotely in this case was measured to be 40% higher than falling back to local small pages. This occurs quite obviously because you are thrashing the remote node trying to allocate thp. > We don't have a testcase for David's 40% latency increase problem, but > that's likely to only happen when the system is somewhat low on memory > globally. Well, yes, but that's most of our systems. We can't keep around gigabytes of memory free just to work around this patch. Removing __GFP_THISNODE to avoid thrashing the local node obviously will incur a substantial performance degradation if you thrash the remote node as well. This should be rather straight forward. > When there's 75% or more of the RAM free (not even allocated as easily > reclaimable pagecache) globally, you don't expect to hit heavy > swapping. > I agree there is no regression introduced by your patch when 75% of memory is free. > The 40% THP allocation latency increase if you use MADV_HUGEPAGE in > such window where all remote zones are fully fragmented is somehow > lesser of a concern in my view (plus there's the compact deferred > logic that should mitigate that scenario). Furthermore it is only a > concern for page faults in MADV_HUGEPAGE ranges. If MADV_HUGEPAGE is > set the userland allocation is long lived, so such higher allocation > latency won't risk to hit short lived allocations that don't set > MADV_HUGEPAGE (unless madvise=always, but that's not the default > precisely because not all allocations are long lived). > > If the MADV_HUGEPAGE using library was freely available it'd also be > nice. > You scan your mappings for .text segments, map a hugepage-aligned region sufficient in size, mremap() to that region, and do MADV_HUGEPAGE.
On Tue, 16 Oct 2018, Mel Gorman wrote: > I consider this to be an unfortunate outcome. On the one hand, we have a > problem that three people can trivially reproduce with known test cases > and a patch shown to resolve the problem. Two of those three people work > on distributions that are exposed to a large number of users. On the > other, we have a problem that requires the system to be in a specific > state and an unknown workload that suffers badly from the remote access > penalties with a patch that has review concerns and has not been proven > to resolve the trivial cases. The specific state is that remote memory is fragmented as well, this is not atypical. Removing __GFP_THISNODE to avoid thrashing a zone will only be beneficial when you can allocate remotely instead. When you cannot allocate remotely instead, you've made the problem much worse for something that should be __GFP_NORETRY in the first place (and was for years) and should never thrash. I'm not interested in patches that require remote nodes to have an abundance of free or unfragmented memory to avoid regressing. > In the case of distributions, the first > patch addresses concerns with a common workload where on the other hand > we have an internal workload of a single company that is affected -- > which indirectly affects many users admittedly but only one entity directly. > The alternative, which is my patch, hasn't been tested or shown why it cannot work. We continue to talk about order >= pageblock_order vs __GFP_COMPACTONLY. I'd like to know, specifically: - what measurable affect my patch has that is better solved with removing __GFP_THISNODE on systems where remote memory is also fragmented? - what platforms benefit from remote access to hugepages vs accessing local small pages (I've asked this maybe 4 or 5 times now)? - how is reclaiming (and possibly thrashing) memory helpful if compaction fails to free an entire pageblock due to slab fragmentation due to low on memory conditions and the page allocator preference to return node- local memory? - how is reclaiming (and possibly thrashing) memory helpful if compaction cannot access the memory reclaimed because the freeing scanner has already passed by it, or the migration scanner has passed by it, since this reclaim is not targeted to pages it can find? - what metrics can be introduced to the page allocator so that we can determine that reclaiming (and possibly thrashing) memory will result in a hugepage being allocated? Until we have answers, especially for the last, there is no reason why thp allocations should not be __GFP_NORETRY including for MADV_HUGEPAGE regions. The implementation of memory compaction simply cannot guarantee that the cost is worthwhile.
Hi David, On 22 Oct 2018, at 17:04, David Rientjes wrote: > On Tue, 16 Oct 2018, Mel Gorman wrote: > >> I consider this to be an unfortunate outcome. On the one hand, we >> have a >> problem that three people can trivially reproduce with known test >> cases >> and a patch shown to resolve the problem. Two of those three people >> work >> on distributions that are exposed to a large number of users. On the >> other, we have a problem that requires the system to be in a specific >> state and an unknown workload that suffers badly from the remote >> access >> penalties with a patch that has review concerns and has not been >> proven >> to resolve the trivial cases. > > The specific state is that remote memory is fragmented as well, this > is > not atypical. Removing __GFP_THISNODE to avoid thrashing a zone will > only > be beneficial when you can allocate remotely instead. When you cannot > allocate remotely instead, you've made the problem much worse for > something that should be __GFP_NORETRY in the first place (and was for > years) and should never thrash. > > I'm not interested in patches that require remote nodes to have an > abundance of free or unfragmented memory to avoid regressing. I just wonder what is the page allocation priority list in your environment, assuming all memory nodes are so fragmented that no huge pages can be obtained without compaction or reclaim. Here is my version of that list, please let me know if it makes sense to you: 1. local huge pages: with compaction and/or page reclaim, you are willing to pay the penalty of getting huge pages; 2. local base pages: since, in your system, remote data accesses have much higher penalty than the extra TLB misses incurred by the base page size; 3. remote huge pages: at least it is better than remote base pages; 4. remote base pages: it performs worst in terms of locality and TLBs. This might not be easy to implement in current kernel, because the zones from remote nodes will always be candidates when kernel is trying get_page_from_freelist(). Only _GFP_THISNODE and MPOL_BIND can eliminate these remote node zones, where _GFP_THISNODE is a kernel version MPOL_BIND and overwrites any user space memory policy other than MPOL_BIND, which is troublesome. In addition, to prioritize local base pages over remote pages, the original huge page allocation has to fail, then kernel can fall back to base page allocations. And you will never get remote huge pages any more if the local base page allocation fails, because there is no way back to huge page allocation after the fallback. Do you expect both behaviors? >> In the case of distributions, the first >> patch addresses concerns with a common workload where on the other >> hand >> we have an internal workload of a single company that is affected -- >> which indirectly affects many users admittedly but only one entity >> directly. >> > > The alternative, which is my patch, hasn't been tested or shown why it > cannot work. We continue to talk about order >= pageblock_order vs > __GFP_COMPACTONLY. > > I'd like to know, specifically: > > - what measurable affect my patch has that is better solved with > removing > __GFP_THISNODE on systems where remote memory is also fragmented? > > - what platforms benefit from remote access to hugepages vs accessing > local small pages (I've asked this maybe 4 or 5 times now)? > > - how is reclaiming (and possibly thrashing) memory helpful if > compaction > fails to free an entire pageblock due to slab fragmentation due to > low > on memory conditions and the page allocator preference to return > node- > local memory? > > - how is reclaiming (and possibly thrashing) memory helpful if > compaction > cannot access the memory reclaimed because the freeing scanner has > already passed by it, or the migration scanner has passed by it, > since > this reclaim is not targeted to pages it can find? > > - what metrics can be introduced to the page allocator so that we can > determine that reclaiming (and possibly thrashing) memory will > result > in a hugepage being allocated? The slab fragmentation and whether reclaim/compaction can help form huge pages seem to orthogonal to this patch, which tries to decide the priority between locality and huge pages. For slab fragmentation, you might find this paper “Making Huge Pages Actually Useful” (https://dl.acm.org/citation.cfm?id=3173203) helpful. The paper is trying to minimize the number of page blocks that have both moveable and non-moveable pages. -- Best Regards Yan Zi
On Mon, Oct 22, 2018 at 02:04:32PM -0700, David Rientjes wrote: > On Tue, 16 Oct 2018, Mel Gorman wrote: > > > I consider this to be an unfortunate outcome. On the one hand, we have a > > problem that three people can trivially reproduce with known test cases > > and a patch shown to resolve the problem. Two of those three people work > > on distributions that are exposed to a large number of users. On the > > other, we have a problem that requires the system to be in a specific > > state and an unknown workload that suffers badly from the remote access > > penalties with a patch that has review concerns and has not been proven > > to resolve the trivial cases. > > The specific state is that remote memory is fragmented as well, this is > not atypical. While not necessarily atypical, *how* it gets fragmented is important. The final state of fragmentation depends on both the input allocation stream *and* the liveness of the mobility of unmovable pages. This is why a target workload is crucial. While it's trivial to fragment memory, it can be in a state that may or may not be trivially compacted. For example, a fragmenting workload that interleaves slab allocations with page cache may fragment memory but if the files are not being used at the time of a THP allocation then they are trivially reclaimed/compacted. However, if the files are in active use, the system remains active. You make the following claim in another response The test case is rather trivial: fragment all memory with order-4 memory to replicate a fragmented local zone, use sched_setaffinity() to bind to that node, and fault a reasonable number of hugepages (128MB, 256, whatever). You do not describe why it's order-4 allocations that are fragmenting or why they remain live. order-4 might imply large jumbo frame allocations but they typically do not remain live for very long unless you are using a driver that recycles large numbers of order-4 pages. It's a critical detail. If your test case is trivial then post it so there is a common base to work from. A test case has been requested from you multiple times already. Note that I accept it's trivial to fragment memory in a harmful way. I've prototyped a test case yesterday that uses fio in the following way to fragment memory o fio of many small files (64K) o create initial pages using writes that disable fallocate and create inodes on first open. This is massively inefficient from an IO perspective but it mixes slab and page cache allocations so all NUMA nodes get fragmented. o Size the page cache so that it's 150% the size of memory so it forces reclaim activity and new fio activity to further mix slab and page cache allocations o After initial write, run parallel readers to keep slab active and run this for the same length of time the initial writes took so fio has called stat() on the existing files and begun the read phase. This forces the slab and page cache pages to remain "live" and difficult to reclaim/compact. o Finally, start a workload that allocates THP after the warmup phase but while fio is still runnning to measure allocation success rate and latencies There are three configurations that use default advice, MADV_HUGEPAGE and "always" fragment that is still running to cover the differet configurations of interest. However, this is completly useless to you as even if this test cases are fixed, there is no guarantee at all that it helps yours. They are still being evaluated as they were recently prototyped but the mmtests configurations in case someone wishes to independently evaluate are; config-global-dhp__workload_thpfioscale config-global-dhp__workload_thpfioscale-defrag config-global-dhp__workload_thpfioscale-madvhugepage It's best to run them on a dedicated test partition if possible. Locally I've configured them to use a freshly created XFS filesystem. If you're going to block fixes for problems that multiple people experience then at least do the courtesy of providing a test case or prototype patches for the complex alternatives you are proposing so they can be independently evaluated. > Removing __GFP_THISNODE to avoid thrashing a zone will only > be beneficial when you can allocate remotely instead. When you cannot > allocate remotely instead, you've made the problem much worse for > something that should be __GFP_NORETRY in the first place (and was for > years) and should never thrash. > > I'm not interested in patches that require remote nodes to have an > abundance of free or unfragmented memory to avoid regressing. > If only there was a test case that could reliably demonstrate this independenly so it can be analysed and fixed *hint hint*. Better yet, kindly prototype a fix. > > In the case of distributions, the first > > patch addresses concerns with a common workload where on the other hand > > we have an internal workload of a single company that is affected -- > > which indirectly affects many users admittedly but only one entity directly. > > > > The alternative, which is my patch, hasn't been tested or shown why it > cannot work. We continue to talk about order >= pageblock_order vs > __GFP_COMPACTONLY. > You already received negative feedback on it. I worried that it would conflate the requirements of THP and hugetlbfs with the latter usually being more willing to tolerate initial latency to get the huge pages. It's also being pointed out numerous times that the bug being addressed is for a trivial workload on a clean system and not a case that depends heavily on the system state. Normally the priority is to fix the trivial case and work on the complex case. Furthermore, Andrea has already stated that the complex alternatives will take too long as they are non-trivial changes and that upstrewam cannot be shipped (for RHEL presumably) with the existing bug. The bug being addressed, and has a confirmed patch for, is addressing a severe regression that "made important workloads unusuable". We are now in a situation where distributions are likely to carry an out-of-tree patch to cover the trivial case with a limited path forward as the blocking test case is unuavailable. All that said, your patches results when tested were inconclusive at best. For the trivial case, system CPU usage is indeed reduced to similar levels to Michal's patch. However, the actual time to complete is 15.7% longer than the vanilla kernel. Michal's patch completed 10.5% faster. > I'd like to know, specifically: > > - what measurable affect my patch has that is better solved with removing > __GFP_THISNODE on systems where remote memory is also fragmented? > In my case, it failed to fix the trivial case. That may be specific to my machine or bad luck but nevertheless, it failed. The patch is also at the RFC level with no test case that can be independently verified. > - what platforms benefit from remote access to hugepages vs accessing > local small pages (I've asked this maybe 4 or 5 times now)? > That's a loaded question. In the case of a workload that has long active phases that fit in L3 cache, the remote access penalty is masked. Furthermore, virtualisation can benefit from THP even if remote due to the reduced depth of page table walks combined with the fact that vcpus may migrate to nodes accessing local memory due to either normal scheduler waker/wakee migrations or automatic NUMA balancing. > - how is reclaiming (and possibly thrashing) memory helpful if compaction > fails to free an entire pageblock due to slab fragmentation due to low > on memory conditions and the page allocator preference to return node- > local memory? > It is not, but it requires knowledge of the future to know if reclaim/compaction will be truely beneficial. While it could be addressed with tracking state and heuristics, no such state or heuristic has been proposed. Furthermore, it's ortotogonal to the patch under discussion and this is somewhat the crux of the matter. You are naking a fix for a known trivial problem and proposing that people work on an unspecified, unbounded, long-lived project as the basis for the nak without providing even a prototype of what you propose. > - how is reclaiming (and possibly thrashing) memory helpful if compaction > cannot access the memory reclaimed because the freeing scanner has > already passed by it, or the migration scanner has passed by it, since > this reclaim is not targeted to pages it can find? > Again, knowledge of the future or a lot of memory scanning combined with heuristics may be required to achieve what you propose. > - what metrics can be introduced to the page allocator so that we can > determine that reclaiming (and possibly thrashing) memory will result > in a hugepage being allocated? > Knowledge of the future to know as we cannot know in advance if reclaimn will succeed without knowing if a workload is keeping contents of slab and the LRU active. > Until we have answers, especially for the last, there is no reason why thp > allocations should not be __GFP_NORETRY including for MADV_HUGEPAGE > regions. The implementation of memory compaction simply cannot guarantee > that the cost is worthwhile. Other than the fact that the __GFP_NORETRY patch failed at least one test that is without a clear indication and changelog showing that it addresses the problems with an unreleased workload.
On Tue, Oct 23, 2018 at 08:57:45AM +0100, Mel Gorman wrote: > Note that I accept it's trivial to fragment memory in a harmful way. > I've prototyped a test case yesterday that uses fio in the following way > to fragment memory > > o fio of many small files (64K) > o create initial pages using writes that disable fallocate and create > inodes on first open. This is massively inefficient from an IO > perspective but it mixes slab and page cache allocations so all > NUMA nodes get fragmented. > o Size the page cache so that it's 150% the size of memory so it forces > reclaim activity and new fio activity to further mix slab and page > cache allocations > o After initial write, run parallel readers to keep slab active and run > this for the same length of time the initial writes took so fio has > called stat() on the existing files and begun the read phase. This > forces the slab and page cache pages to remain "live" and difficult > to reclaim/compact. > o Finally, start a workload that allocates THP after the warmup phase > but while fio is still runnning to measure allocation success rate > and latencies > The tests completed shortly after I wrote this mail so I can put some figures to the intuitions expressed in this mail. I'm truncating the reports for clarity but can upload the full data if necessary. The target system is a 2-socket using E5-2670 v3 (Haswell). Base kernel is 4.19. The baseline is an unpatched kernel. relaxthisnode-v1r1 is patch 1 of Michal's series and does not include the second cleanup. noretry-v1r1 is David's alternative global-dhp__workload_usemem-stress-numa-compact (no filesystem as this is the trivial case of allocating anonymous memory on a freshly booted system. Figures are elapsed time) 4.19.0 4.19.0 4.19.0 vanilla relaxthisnode-v1r1 noretry-v1r1 Amean System-1 14.16 ( 0.00%) 12.35 * 12.75%* 15.96 * -12.70%* Amean System-3 15.14 ( 0.00%) 9.83 * 35.08%* 11.00 * 27.34%* Amean System-4 9.88 ( 0.00%) 9.85 ( 0.25%) 9.80 ( 0.75%) Amean Elapsd-1 29.23 ( 0.00%) 26.16 * 10.50%* 33.81 * -15.70%* Amean Elapsd-3 25.67 ( 0.00%) 7.28 * 71.63%* 8.49 * 66.93%* Amean Elapsd-4 5.49 ( 0.00%) 5.53 ( -0.76%) 5.46 ( 0.49%) The figures in () are the percentage gain/loss. If it's around *'s then the automation has guessed at the results are outside the noise. System CPU usage is reduced by both as reported but Micha's gives a 10.5% gain and David's is a 15.7% loss. Boith appear to be outside the noise. While not included here, the vanilla kernel swaps heavily with a 56% reclaim efficiency (pages scanned vs pages reclaimed) and neither of the proposed patches swaps and it's all from direct reclaim activity. Michal's patch does not enter reclaim, David's enters reclaim but it's very light. global-dhp__workload_thpfioscale-xfs (Uses fio to fragment memory and keep slab and page cache active while there is an attempt to allocate THP in parallel. No special madvise flags or tuning is applied. A dedicated test partition is used for fio and XFS was the target filesystem that is recreated on every test) thpfioscale Fault Latencies 4.19.0 4.19.0 4.19.0 vanilla relaxthisnode-v1r1 noretry-v1r1 Amean fault-base-5 1471.95 ( 0.00%) 1515.64 ( -2.97%) 1491.05 ( -1.30%) Amean fault-huge-5 0.00 ( 0.00%) 534.51 * -99.00%* 0.00 ( 0.00%) thpfioscale Percentage Faults Huge 4.19.0 4.19.0 4.19.0 vanilla relaxthisnode-v1r1 noretry-v1r1 Percentage huge-5 0.00 ( 0.00%) 1.18 ( 100.00%) 0.00 ( 0.00%) Both patches incur a slight hit to fault latency (measured in microseconds) but it's well within the noise. While not included here, the variance is massive (min 1052 microseconds, max 282348 microseconds in the vanilla kernel. Both patches reduce the worst-case scenarios. All kernels show terrible allocation success rates. Michal's had a 1.18% success rate but that's probably luck. global-dhp__workload_thpfioscale-madvhugepage-xfs (Same as the last test but the THP allocation program uses MADV_HUGEPAGE) thpfioscale Fault Latencies 4.19.0 4.19.0 4.19.0 vanilla relaxthisnode-v1r1 noretry-v1r1 Amean fault-base-5 6772.84 ( 0.00%) 10256.30 * -51.43%* 1574.45 * 76.75%* Amean fault-huge-5 2644.19 ( 0.00%) 5314.17 *-100.98%* 3517.89 ( -33.04%) thpfioscale Percentage Faults Huge 4.19.0 4.19.0 4.19.0 vanilla relaxthisnode-v1r1 noretry-v1r1 Percentage huge-5 45.48 ( 0.00%) 95.09 ( 109.08%) 2.81 ( -93.81% The first point of interest is that even with the vanilla kernel, the allocation fault latency is much higher than average reflecting that additional work is being done. Next point of interest -- David's patch has much lower latency on average when allocating *base* pages showing and the vmstats (not included) show that compaction activity is reduced but not eliminated. To balance this, Michal's patch has an 95% allocation success rate for THP versus 45% on the default kernel at the cost of higher fault latency. This is almost certainly a reflection that THPs are being allocated on remote nodes. This can be considered good or bad depending on whether THP is more important than locality. Note with David's patch that the allocation success rate drops to 2.81% showing that it's much less efficient at THP. This demonstrates a very clear trade-off between allocation latency and allocation success rate for THP. Which one is better is workload dependent. global-dhp__workload_thpfioscale-defrag-xfs (Same as global-dhp__workload_thpfioscale-xfs except that defrag is set to always) thpfioscale Fault Latencies 4.19.0 4.19.0 4.19.0 vanilla relaxthisnode-v1r1 noretry-v1r1 Amean fault-base-5 2678.60 ( 0.00%) 4442.14 * -65.84%* 1640.15 * 38.77%* Amean fault-huge-5 1324.61 ( 0.00%) 1460.08 ( -10.23%) 2358.23 ( -78.03%) thpfioscale Percentage Faults Huge 4.19.0 4.19.0 4.19.0 vanilla relaxthisnode-v1r1 noretry-v1r1 Percentage huge-5 0.90 ( 0.00%) 0.40 ( -55.56%) 0.22 ( -75.93%) The allocation latency is again higher in this case as greater effort is made to allocate the huge page. Michal's takes a hit as it's still trying to allocate the THP while David's gives up early. In all cases the allocation success rate is terrible. So it should be reasonably clear that no approach is a universal win. Michal's wins at the trivial case which is what the original problem was and why it was pushed at all. David's in general has lower latency in general because it gives up quickly but the allocation success rate when MADV_HUGEPAGE specifically asks for huge pages is terrible. This may make it a non-starter for the virtualisation case that wants huge pages on the basis that if an application asks for huge pages, it presumably is willing to pay the cost to get them.
On Mon, 22 Oct 2018, Zi Yan wrote: > Hi David, > Hi! > On 22 Oct 2018, at 17:04, David Rientjes wrote: > > > On Tue, 16 Oct 2018, Mel Gorman wrote: > > > > > I consider this to be an unfortunate outcome. On the one hand, we have a > > > problem that three people can trivially reproduce with known test cases > > > and a patch shown to resolve the problem. Two of those three people work > > > on distributions that are exposed to a large number of users. On the > > > other, we have a problem that requires the system to be in a specific > > > state and an unknown workload that suffers badly from the remote access > > > penalties with a patch that has review concerns and has not been proven > > > to resolve the trivial cases. > > > > The specific state is that remote memory is fragmented as well, this is > > not atypical. Removing __GFP_THISNODE to avoid thrashing a zone will only > > be beneficial when you can allocate remotely instead. When you cannot > > allocate remotely instead, you've made the problem much worse for > > something that should be __GFP_NORETRY in the first place (and was for > > years) and should never thrash. > > > > I'm not interested in patches that require remote nodes to have an > > abundance of free or unfragmented memory to avoid regressing. > > I just wonder what is the page allocation priority list in your environment, > assuming all memory nodes are so fragmented that no huge pages can be > obtained without compaction or reclaim. > > Here is my version of that list, please let me know if it makes sense to you: > > 1. local huge pages: with compaction and/or page reclaim, you are willing > to pay the penalty of getting huge pages; > > 2. local base pages: since, in your system, remote data accesses have much > higher penalty than the extra TLB misses incurred by the base page size; > > 3. remote huge pages: at least it is better than remote base pages; > > 4. remote base pages: it performs worst in terms of locality and TLBs. > I have a ton of different platforms available. Consider a very basic access latency evaluation on Broadwell on a running production system: remote hugepage vs remote PAGE_SIZE pages had about 5% better access latency. Remote PAGE_SIZE pages vs local pages is a 12% degradation. On Naples, remote hugepage vs remote PAGE_SIZE had 2% better access latency intrasocket, no better access latency intersocket. Remote PAGE_SIZE pages vs local is a 16% degradation intrasocket and 38% degradation intersocket. My list removes (3) from your list, but is otherwise unchanged. I remove (3) because 2-5% better access latency is nice, but we'd much rather fault local base pages and then let khugepaged collapse it into a local hugepage when fragmentation is improved or we have freed memory. That is where we can get a much better result, 41% better access latency on Broadwell and 52% better access latncy on Naples. I wouldn't trade that for 2-5% immediate remote hugepages. It just so happens that prior to this patch, the implementation of the page allocator matches this preference. > In addition, to prioritize local base pages over remote pages, > the original huge page allocation has to fail, then kernel can > fall back to base page allocations. And you will never get remote > huge pages any more if the local base page allocation fails, > because there is no way back to huge page allocation after the fallback. > That is exactly what we want, we want khugepaged to collapse memory into local hugepages for the big improvement rather than persistently access a hugepage remotely; the win of the remote hugepage just isn't substantial enough, and the win of the local hugepage is just too great. > > I'd like to know, specifically: > > > > - what measurable affect my patch has that is better solved with removing > > __GFP_THISNODE on systems where remote memory is also fragmented? > > > > - what platforms benefit from remote access to hugepages vs accessing > > local small pages (I've asked this maybe 4 or 5 times now)? > > > > - how is reclaiming (and possibly thrashing) memory helpful if compaction > > fails to free an entire pageblock due to slab fragmentation due to low > > on memory conditions and the page allocator preference to return node- > > local memory? > > > > - how is reclaiming (and possibly thrashing) memory helpful if compaction > > cannot access the memory reclaimed because the freeing scanner has > > already passed by it, or the migration scanner has passed by it, since > > this reclaim is not targeted to pages it can find? > > > > - what metrics can be introduced to the page allocator so that we can > > determine that reclaiming (and possibly thrashing) memory will result > > in a hugepage being allocated? > > The slab fragmentation and whether reclaim/compaction can help form > huge pages seem to orthogonal to this patch, which tries to decide > the priority between locality and huge pages. > It's not orthogonal to the problem being reported which requires local memory pressure. If there is no memory pressure, compaction often can succeed without reclaim because the freeing scanner can find target memory and the migration scanner can make a pageblock free. Under memory pressure, however, where Andrea is experiencing the thrashing of the local node, by this time it can be inferred that slab pages have already fallen bcak to MIGRATE_MOVABLE pageblocks. There is nothing preventing it under memory pressure because of the preference to return local memory over fragmenting pageblocks. So the point of slab fragmentation, which typically exists locally when there is memory pressure, is that we cannot ascertain whether memory compaction even with reclaim will be successful. Not only because the freeing scanner cannot access reclaimed memory, but also because we have no feedback from compaction to determine whether the work will be useful. Thrashing the local node, migrating COMPACT_CLUSTER_MAX pages, finding one slab page sitting in the pageblock, and continuing is not a good use of the allocator's time. This is true of both MADV_HUGEPAGE and non-MADV_HUGEPAGE regions. For reclaim to be considered, we should ensure that work is useful to compaction. That ability is non-existant. The worst case scenario is you thrash the local node and still cannot allocate a hugepage.
On Tue, Sep 25, 2018 at 02:03:25PM +0200, Michal Hocko wrote: > From: Andrea Arcangeli <aarcange@redhat.com> > > THP allocation might be really disruptive when allocated on NUMA system > with the local node full or hard to reclaim. Stefan has posted an > allocation stall report on 4.12 based SLES kernel which suggests the > same issue: > > [245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null) > [245513.363983] kvm cpuset=/ mems_allowed=0-1 > [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased) > [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017 > [245513.365905] Call Trace: > [245513.366535] dump_stack+0x5c/0x84 > [245513.367148] warn_alloc+0xe0/0x180 > [245513.367769] __alloc_pages_slowpath+0x820/0xc90 > [245513.368406] ? __slab_free+0xa9/0x2f0 > [245513.369048] ? __slab_free+0xa9/0x2f0 > [245513.369671] __alloc_pages_nodemask+0x1cc/0x210 > [245513.370300] alloc_pages_vma+0x1e5/0x280 > [245513.370921] do_huge_pmd_wp_page+0x83f/0xf00 > [245513.371554] ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0 > [245513.372184] ? do_huge_pmd_anonymous_page+0x631/0x6d0 > [245513.372812] __handle_mm_fault+0x93d/0x1060 > [245513.373439] handle_mm_fault+0xc6/0x1b0 > [245513.374042] __do_page_fault+0x230/0x430 > [245513.374679] ? get_vtime_delta+0x13/0xb0 > [245513.375411] do_page_fault+0x2a/0x70 > [245513.376145] ? page_fault+0x65/0x80 > [245513.376882] page_fault+0x7b/0x80 > [...] > [245513.382056] Mem-Info: > [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5 > active_file:60183 inactive_file:245285 isolated_file:0 > unevictable:15657 dirty:286 writeback:1 unstable:0 > slab_reclaimable:75543 slab_unreclaimable:2509111 > mapped:81814 shmem:31764 pagetables:370616 bounce:0 > free:32294031 free_pcp:6233 free_cma:0 > [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no > [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no > > The defrag mode is "madvise" and from the above report it is clear that > the THP has been allocated for MADV_HUGEPAGA vma. > > Andrea has identified that the main source of the problem is > __GFP_THISNODE usage: > > : The problem is that direct compaction combined with the NUMA > : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very > : hard the local node, instead of failing the allocation if there's no > : THP available in the local node. > : > : Such logic was ok until __GFP_THISNODE was added to the THP allocation > : path even with MPOL_DEFAULT. > : > : The idea behind the __GFP_THISNODE addition, is that it is better to > : provide local memory in PAGE_SIZE units than to use remote NUMA THP > : backed memory. That largely depends on the remote latency though, on > : threadrippers for example the overhead is relatively low in my > : experience. > : > : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in > : extremely slow qemu startup with vfio, if the VM is larger than the > : size of one host NUMA node. This is because it will try very hard to > : unsuccessfully swapout get_user_pages pinned pages as result of the > : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE > : allocations and instead of trying to allocate THP on other nodes (it > : would be even worse without vfio type1 GUP pins of course, except it'd > : be swapping heavily instead). > > Fix this by removing __GFP_THISNODE for THP requests which are > requesting the direct reclaim. This effectivelly reverts 5265047ac301 on > the grounds that the zone/node reclaim was known to be disruptive due > to premature reclaim when there was memory free. While it made sense at > the time for HPC workloads without NUMA awareness on rare machines, it > was ultimately harmful in the majority of cases. The existing behaviour > is similiar, if not as widespare as it applies to a corner case but > crucially, it cannot be tuned around like zone_reclaim_mode can. The > default behaviour should always be to cause the least harm for the > common case. > > If there are specialised use cases out there that want zone_reclaim_mode > in specific cases, then it can be built on top. Longterm we should > consider a memory policy which allows for the node reclaim like behavior > for the specific memory ranges which would allow a > > [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com > I think we have a similar problem elsewhere too I've run into cases where alloc_pool_huge_page() took forever looping in reclaim via compaction_test. My tests and tracing eventually showed that the root cause was we were looping in should_continue_reclaim() due to __GFP_RETRY_MAYFAIL (set in alloc_fresh_huge_page()). The scanned value was much lesser than sc->order. I have a small RFC patch that I am testing and it seems good to so far, having said that the issue is hard to reproduce and takes a while to hit. I wonder if alloc_pool_huge_page() should also trim out it's logic of __GFP_THISNODE for the same reasons as mentioned here. I like that we round robin to alloc the pool pages, but __GFP_THISNODE might be an overkill for that case as well. Balbir Singh.
On Mon 29-10-18 16:17:52, Balbir Singh wrote: [...] > I wonder if alloc_pool_huge_page() should also trim out it's logic > of __GFP_THISNODE for the same reasons as mentioned here. I like > that we round robin to alloc the pool pages, but __GFP_THISNODE > might be an overkill for that case as well. alloc_pool_huge_page uses __GFP_THISNODE for a different reason than THP. We really do want to allocated for a per-node pool. THP can fallback or use a different node. These hugetlb allocations might be disruptive and that is an expected behavior because this is an explicit requirement from an admin to pre-allocate large pages for the future use. __GFP_RETRY_MAYFAIL just underlines that requirement. Maybe the compaction logic could be improved and that might be a shared goal with future changes though.
On Mon, Oct 29, 2018 at 10:00:35AM +0100, Michal Hocko wrote: > On Mon 29-10-18 16:17:52, Balbir Singh wrote: > [...] > > I wonder if alloc_pool_huge_page() should also trim out it's logic > > of __GFP_THISNODE for the same reasons as mentioned here. I like > > that we round robin to alloc the pool pages, but __GFP_THISNODE > > might be an overkill for that case as well. > > alloc_pool_huge_page uses __GFP_THISNODE for a different reason than > THP. We really do want to allocated for a per-node pool. THP can > fallback or use a different node. > Not really static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed) ... gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; ... for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) { page = alloc_fresh_huge_page(h, gfp_mask, node, nodes_allowed); if (page) break; } The code just tries to distribute the pool > These hugetlb allocations might be disruptive and that is an expected > behavior because this is an explicit requirement from an admin to > pre-allocate large pages for the future use. __GFP_RETRY_MAYFAIL just > underlines that requirement. Yes, but in the absence of a particular node, for example via sysctl (as the compaction does), I don't think it is a hard requirement to get a page from a particular node. I agree we need __GFP_RETRY_FAIL, in any case the real root cause for me is should_reclaim_continue() which keeps the task looping without making forward progress. The __GFP_THISNODE was again an example of mis-leading the allocator in this case, IMHO. > > Maybe the compaction logic could be improved and that might be a shared > goal with future changes though. I'll also send my RFC once my testing is done, assuming I get it to reproduce with a desired frequency. Balbir Singh.
On Mon 29-10-18 20:42:53, Balbir Singh wrote: > On Mon, Oct 29, 2018 at 10:00:35AM +0100, Michal Hocko wrote: [...] > > These hugetlb allocations might be disruptive and that is an expected > > behavior because this is an explicit requirement from an admin to > > pre-allocate large pages for the future use. __GFP_RETRY_MAYFAIL just > > underlines that requirement. > > Yes, but in the absence of a particular node, for example via sysctl > (as the compaction does), I don't think it is a hard requirement to get > a page from a particular node. Again this seems like a deliberate decision. You want your distributions as even as possible otherwise the NUMA placement will be much less deterministic. At least that was the case for a long time. If you have different per-node preferences, just use NUMA aware pre-allocation. > I agree we need __GFP_RETRY_FAIL, in any > case the real root cause for me is should_reclaim_continue() which keeps > the task looping without making forward progress. This seems like a separate issue which should better be debugged. Please open a new thread describing the problem and the state of the node.
Hello, On Mon, Oct 29, 2018 at 11:08:34AM +0100, Michal Hocko wrote: > This seems like a separate issue which should better be debugged. Please > open a new thread describing the problem and the state of the node. Yes, in my view it should be evaluated separately too, because it's overall less concerning: __GFP_THISNODE there can only be set by the root user there. So it has a chance to be legitimate behavior there. Let's focus on solving the __GFP_THISNODE that any user in the system can set (not only root) and cause severe and unexpected swap storms or slowdowns to all other processes run by other users. ls -l /sys/kernel/mm/hugepages/*/nr_hugepages (and boot command line) Thanks, Andrea
diff --git a/mm/mempolicy.c b/mm/mempolicy.c index da858f794eb6..149b6f4cf023 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2046,8 +2046,36 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, nmask = policy_nodemask(gfp, pol); if (!nmask || node_isset(hpage_node, *nmask)) { mpol_cond_put(pol); - page = __alloc_pages_node(hpage_node, - gfp | __GFP_THISNODE, order); + /* + * We cannot invoke reclaim if __GFP_THISNODE + * is set. Invoking reclaim with + * __GFP_THISNODE set, would cause THP + * allocations to trigger heavy swapping + * despite there may be tons of free memory + * (including potentially plenty of THP + * already available in the buddy) on all the + * other NUMA nodes. + * + * At most we could invoke compaction when + * __GFP_THISNODE is set (but we would need to + * refrain from invoking reclaim even if + * compaction returned COMPACT_SKIPPED because + * there wasn't not enough memory to succeed + * compaction). For now just avoid + * __GFP_THISNODE instead of limiting the + * allocation path to a strict and single + * compaction invocation. + * + * Supposedly if direct reclaim was enabled by + * the caller, the app prefers THP regardless + * of the node it comes from so this would be + * more desiderable behavior than only + * providing THP originated from the local + * node in such case. + */ + if (!(gfp & __GFP_DIRECT_RECLAIM)) + gfp |= __GFP_THISNODE; + page = __alloc_pages_node(hpage_node, gfp, order); goto out; } }