diff mbox series

[1/2] mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings

Message ID 20180925120326.24392-2-mhocko@kernel.org (mailing list archive)
State New, archived
Headers show
Series thp nodereclaim fixes | expand

Commit Message

Michal Hocko Sept. 25, 2018, 12:03 p.m. UTC
From: Andrea Arcangeli <aarcange@redhat.com>

THP allocation might be really disruptive when allocated on NUMA system
with the local node full or hard to reclaim. Stefan has posted an
allocation stall report on 4.12 based SLES kernel which suggests the
same issue:

[245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
[245513.363983] kvm cpuset=/ mems_allowed=0-1
[245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
[245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
[245513.365905] Call Trace:
[245513.366535]  dump_stack+0x5c/0x84
[245513.367148]  warn_alloc+0xe0/0x180
[245513.367769]  __alloc_pages_slowpath+0x820/0xc90
[245513.368406]  ? __slab_free+0xa9/0x2f0
[245513.369048]  ? __slab_free+0xa9/0x2f0
[245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
[245513.370300]  alloc_pages_vma+0x1e5/0x280
[245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
[245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
[245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
[245513.372812]  __handle_mm_fault+0x93d/0x1060
[245513.373439]  handle_mm_fault+0xc6/0x1b0
[245513.374042]  __do_page_fault+0x230/0x430
[245513.374679]  ? get_vtime_delta+0x13/0xb0
[245513.375411]  do_page_fault+0x2a/0x70
[245513.376145]  ? page_fault+0x65/0x80
[245513.376882]  page_fault+0x7b/0x80
[...]
[245513.382056] Mem-Info:
[245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
                 active_file:60183 inactive_file:245285 isolated_file:0
                 unevictable:15657 dirty:286 writeback:1 unstable:0
                 slab_reclaimable:75543 slab_unreclaimable:2509111
                 mapped:81814 shmem:31764 pagetables:370616 bounce:0
                 free:32294031 free_pcp:6233 free_cma:0
[245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no

The defrag mode is "madvise" and from the above report it is clear that
the THP has been allocated for MADV_HUGEPAGA vma.

Andrea has identified that the main source of the problem is
__GFP_THISNODE usage:

: The problem is that direct compaction combined with the NUMA
: __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
: hard the local node, instead of failing the allocation if there's no
: THP available in the local node.
:
: Such logic was ok until __GFP_THISNODE was added to the THP allocation
: path even with MPOL_DEFAULT.
:
: The idea behind the __GFP_THISNODE addition, is that it is better to
: provide local memory in PAGE_SIZE units than to use remote NUMA THP
: backed memory. That largely depends on the remote latency though, on
: threadrippers for example the overhead is relatively low in my
: experience.
:
: The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
: extremely slow qemu startup with vfio, if the VM is larger than the
: size of one host NUMA node. This is because it will try very hard to
: unsuccessfully swapout get_user_pages pinned pages as result of the
: __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
: allocations and instead of trying to allocate THP on other nodes (it
: would be even worse without vfio type1 GUP pins of course, except it'd
: be swapping heavily instead).

Fix this by removing __GFP_THISNODE for THP requests which are
requesting the direct reclaim. This effectivelly reverts 5265047ac301 on
the grounds that the zone/node reclaim was known to be disruptive due
to premature reclaim when there was memory free. While it made sense at
the time for HPC workloads without NUMA awareness on rare machines, it
was ultimately harmful in the majority of cases. The existing behaviour
is similiar, if not as widespare as it applies to a corner case but
crucially, it cannot be tuned around like zone_reclaim_mode can. The
default behaviour should always be to cause the least harm for the
common case.

If there are specialised use cases out there that want zone_reclaim_mode
in specific cases, then it can be built on top. Longterm we should
consider a memory policy which allows for the node reclaim like behavior
for the specific memory ranges which would allow a

[1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com

[mhocko@suse.com: rewrote the changelog based on the one from Andrea]
Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: stable # 4.1+
Reported-by: Stefan Priebe <s.priebe@profihost.ag>
Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/mempolicy.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

Comments

Mel Gorman Sept. 25, 2018, 12:20 p.m. UTC | #1
On Tue, Sep 25, 2018 at 02:03:25PM +0200, Michal Hocko wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> THP allocation might be really disruptive when allocated on NUMA system
> with the local node full or hard to reclaim. Stefan has posted an
> allocation stall report on 4.12 based SLES kernel which suggests the
> same issue:
> 
> [245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
> [245513.363983] kvm cpuset=/ mems_allowed=0-1
> [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
> [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
> [245513.365905] Call Trace:
> [245513.366535]  dump_stack+0x5c/0x84
> [245513.367148]  warn_alloc+0xe0/0x180
> [245513.367769]  __alloc_pages_slowpath+0x820/0xc90
> [245513.368406]  ? __slab_free+0xa9/0x2f0
> [245513.369048]  ? __slab_free+0xa9/0x2f0
> [245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
> [245513.370300]  alloc_pages_vma+0x1e5/0x280
> [245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
> [245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
> [245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
> [245513.372812]  __handle_mm_fault+0x93d/0x1060
> [245513.373439]  handle_mm_fault+0xc6/0x1b0
> [245513.374042]  __do_page_fault+0x230/0x430
> [245513.374679]  ? get_vtime_delta+0x13/0xb0
> [245513.375411]  do_page_fault+0x2a/0x70
> [245513.376145]  ? page_fault+0x65/0x80
> [245513.376882]  page_fault+0x7b/0x80
> [...]
> [245513.382056] Mem-Info:
> [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
>                  active_file:60183 inactive_file:245285 isolated_file:0
>                  unevictable:15657 dirty:286 writeback:1 unstable:0
>                  slab_reclaimable:75543 slab_unreclaimable:2509111
>                  mapped:81814 shmem:31764 pagetables:370616 bounce:0
>                  free:32294031 free_pcp:6233 free_cma:0
> [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> 
> The defrag mode is "madvise" and from the above report it is clear that
> the THP has been allocated for MADV_HUGEPAGA vma.
> 
> Andrea has identified that the main source of the problem is
> __GFP_THISNODE usage:
> 
> : The problem is that direct compaction combined with the NUMA
> : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
> : hard the local node, instead of failing the allocation if there's no
> : THP available in the local node.
> :
> : Such logic was ok until __GFP_THISNODE was added to the THP allocation
> : path even with MPOL_DEFAULT.
> :
> : The idea behind the __GFP_THISNODE addition, is that it is better to
> : provide local memory in PAGE_SIZE units than to use remote NUMA THP
> : backed memory. That largely depends on the remote latency though, on
> : threadrippers for example the overhead is relatively low in my
> : experience.
> :
> : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
> : extremely slow qemu startup with vfio, if the VM is larger than the
> : size of one host NUMA node. This is because it will try very hard to
> : unsuccessfully swapout get_user_pages pinned pages as result of the
> : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
> : allocations and instead of trying to allocate THP on other nodes (it
> : would be even worse without vfio type1 GUP pins of course, except it'd
> : be swapping heavily instead).
> 
> Fix this by removing __GFP_THISNODE for THP requests which are
> requesting the direct reclaim. This effectivelly reverts 5265047ac301 on
> the grounds that the zone/node reclaim was known to be disruptive due
> to premature reclaim when there was memory free. While it made sense at
> the time for HPC workloads without NUMA awareness on rare machines, it
> was ultimately harmful in the majority of cases. The existing behaviour
> is similiar, if not as widespare as it applies to a corner case but
> crucially, it cannot be tuned around like zone_reclaim_mode can. The
> default behaviour should always be to cause the least harm for the
> common case.
> 
> If there are specialised use cases out there that want zone_reclaim_mode
> in specific cases, then it can be built on top. Longterm we should
> consider a memory policy which allows for the node reclaim like behavior
> for the specific memory ranges which would allow a
> 
> [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com
> 
> [mhocko@suse.com: rewrote the changelog based on the one from Andrea]
> Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
> Cc: Zi Yan <zi.yan@cs.rutgers.edu>
> Cc: stable # 4.1+
> Reported-by: Stefan Priebe <s.priebe@profihost.ag>
> Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
> Reported-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Mel Gorman <mgorman@techsingularity.net>

Both patches look correct to me but I'm responding to this one because
it's the fix. The change makes sense and moves further away from the
severe stalling behaviour we used to see with both THP and zone reclaim
mode.

I put together a basic experiment with usemem configured to reference a
buffer multiple times that is 80% the size of main memory on a 2-socket box
with symmetric node sizes and defrag set to "always".  The defrag setting
is not the default but it would be functionally similar to accessing a
buffer with madvise(MADV_HUGEPAGE). Usemem is configured to reference
the buffer multiple times and while it's not an interesting workload,
it would be expected to complete reasonably quickly as it fits within
memory. The results were;

usemem
                                  vanilla           noreclaim-v1
Amean     Elapsd-1       42.78 (   0.00%)       26.87 (  37.18%)
Amean     Elapsd-3       27.55 (   0.00%)        7.44 (  73.00%)
Amean     Elapsd-4        5.72 (   0.00%)        5.69 (   0.45%)

This shows the elapsed time in seconds for 1 thread, 3 threads and 4 threads
referencing buffers 80% the size of memory. With the patches applied, it's
37.18% faster for the single thread and 73% faster with two threads. Note
that 4 threads showing little difference does not indicate the problem is
related to thread counts. It's simply the case that 4 threads gets spread
so their workload mostly fits in one node.

The overall view from /proc/vmstats is more startling

                         4.19.0-rc1  4.19.0-rc1
                            vanillanoreclaim-v1r1
Minor Faults               35593425      708164
Major Faults                 484088          36
Swap Ins                    3772837           0
Swap Outs                   3932295           0

Massive amounts of swap in/out without the patch

Direct pages scanned        6013214           0
Kswapd pages scanned              0           0
Kswapd pages reclaimed            0           0
Direct pages reclaimed      4033009           0

Lots of reclaim activity without the patch

Kswapd efficiency              100%        100%
Kswapd velocity               0.000       0.000
Direct efficiency               67%        100%
Direct velocity           11191.956       0.000

Mostly from direct reclaim context as you'd expect without the patch.

Page writes by reclaim  3932314.000       0.000
Page writes file                 19           0
Page writes anon            3932295           0
Page reclaim immediate        42336           0

Writes from reclaim context is never good but the patch eliminates it.

We should never have default behaviour to thrash the system for such a
basic workload. If zone reclaim mode behaviour is ever desired but on a
single task instead of a global basis then the sensible option is to build
a mempolicy that enforces that behaviour.
Michal Hocko Sept. 25, 2018, 12:30 p.m. UTC | #2
On Tue 25-09-18 13:20:08, Mel Gorman wrote:
> On Tue, Sep 25, 2018 at 02:03:25PM +0200, Michal Hocko wrote:
> > From: Andrea Arcangeli <aarcange@redhat.com>
> > 
> > THP allocation might be really disruptive when allocated on NUMA system
> > with the local node full or hard to reclaim. Stefan has posted an
> > allocation stall report on 4.12 based SLES kernel which suggests the
> > same issue:
> > 
> > [245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
> > [245513.363983] kvm cpuset=/ mems_allowed=0-1
> > [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
> > [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
> > [245513.365905] Call Trace:
> > [245513.366535]  dump_stack+0x5c/0x84
> > [245513.367148]  warn_alloc+0xe0/0x180
> > [245513.367769]  __alloc_pages_slowpath+0x820/0xc90
> > [245513.368406]  ? __slab_free+0xa9/0x2f0
> > [245513.369048]  ? __slab_free+0xa9/0x2f0
> > [245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
> > [245513.370300]  alloc_pages_vma+0x1e5/0x280
> > [245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
> > [245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
> > [245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
> > [245513.372812]  __handle_mm_fault+0x93d/0x1060
> > [245513.373439]  handle_mm_fault+0xc6/0x1b0
> > [245513.374042]  __do_page_fault+0x230/0x430
> > [245513.374679]  ? get_vtime_delta+0x13/0xb0
> > [245513.375411]  do_page_fault+0x2a/0x70
> > [245513.376145]  ? page_fault+0x65/0x80
> > [245513.376882]  page_fault+0x7b/0x80
> > [...]
> > [245513.382056] Mem-Info:
> > [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
> >                  active_file:60183 inactive_file:245285 isolated_file:0
> >                  unevictable:15657 dirty:286 writeback:1 unstable:0
> >                  slab_reclaimable:75543 slab_unreclaimable:2509111
> >                  mapped:81814 shmem:31764 pagetables:370616 bounce:0
> >                  free:32294031 free_pcp:6233 free_cma:0
> > [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> > [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> > 
> > The defrag mode is "madvise" and from the above report it is clear that
> > the THP has been allocated for MADV_HUGEPAGA vma.
> > 
> > Andrea has identified that the main source of the problem is
> > __GFP_THISNODE usage:
> > 
> > : The problem is that direct compaction combined with the NUMA
> > : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
> > : hard the local node, instead of failing the allocation if there's no
> > : THP available in the local node.
> > :
> > : Such logic was ok until __GFP_THISNODE was added to the THP allocation
> > : path even with MPOL_DEFAULT.
> > :
> > : The idea behind the __GFP_THISNODE addition, is that it is better to
> > : provide local memory in PAGE_SIZE units than to use remote NUMA THP
> > : backed memory. That largely depends on the remote latency though, on
> > : threadrippers for example the overhead is relatively low in my
> > : experience.
> > :
> > : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
> > : extremely slow qemu startup with vfio, if the VM is larger than the
> > : size of one host NUMA node. This is because it will try very hard to
> > : unsuccessfully swapout get_user_pages pinned pages as result of the
> > : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
> > : allocations and instead of trying to allocate THP on other nodes (it
> > : would be even worse without vfio type1 GUP pins of course, except it'd
> > : be swapping heavily instead).
> > 
> > Fix this by removing __GFP_THISNODE for THP requests which are
> > requesting the direct reclaim. This effectivelly reverts 5265047ac301 on
> > the grounds that the zone/node reclaim was known to be disruptive due
> > to premature reclaim when there was memory free. While it made sense at
> > the time for HPC workloads without NUMA awareness on rare machines, it
> > was ultimately harmful in the majority of cases. The existing behaviour
> > is similiar, if not as widespare as it applies to a corner case but
> > crucially, it cannot be tuned around like zone_reclaim_mode can. The
> > default behaviour should always be to cause the least harm for the
> > common case.
> > 
> > If there are specialised use cases out there that want zone_reclaim_mode
> > in specific cases, then it can be built on top. Longterm we should
> > consider a memory policy which allows for the node reclaim like behavior
> > for the specific memory ranges which would allow a
> > 
> > [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com
> > 
> > [mhocko@suse.com: rewrote the changelog based on the one from Andrea]
> > Fixes: 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node")
> > Cc: Zi Yan <zi.yan@cs.rutgers.edu>
> > Cc: stable # 4.1+
> > Reported-by: Stefan Priebe <s.priebe@profihost.ag>
> > Debugged-by: Andrea Arcangeli <aarcange@redhat.com>
> > Reported-by: Alex Williamson <alex.williamson@redhat.com>
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Reviewed-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Both patches look correct to me but I'm responding to this one because
> it's the fix. The change makes sense and moves further away from the
> severe stalling behaviour we used to see with both THP and zone reclaim
> mode.
> 
> I put together a basic experiment with usemem configured to reference a
> buffer multiple times that is 80% the size of main memory on a 2-socket box
> with symmetric node sizes and defrag set to "always".  The defrag setting
> is not the default but it would be functionally similar to accessing a
> buffer with madvise(MADV_HUGEPAGE). Usemem is configured to reference
> the buffer multiple times and while it's not an interesting workload,
> it would be expected to complete reasonably quickly as it fits within
> memory. The results were;
> 
> usemem
>                                   vanilla           noreclaim-v1
> Amean     Elapsd-1       42.78 (   0.00%)       26.87 (  37.18%)
> Amean     Elapsd-3       27.55 (   0.00%)        7.44 (  73.00%)
> Amean     Elapsd-4        5.72 (   0.00%)        5.69 (   0.45%)
> 
> This shows the elapsed time in seconds for 1 thread, 3 threads and 4 threads
> referencing buffers 80% the size of memory. With the patches applied, it's
> 37.18% faster for the single thread and 73% faster with two threads. Note
> that 4 threads showing little difference does not indicate the problem is
> related to thread counts. It's simply the case that 4 threads gets spread
> so their workload mostly fits in one node.
> 
> The overall view from /proc/vmstats is more startling
> 
>                          4.19.0-rc1  4.19.0-rc1
>                             vanillanoreclaim-v1r1
> Minor Faults               35593425      708164
> Major Faults                 484088          36
> Swap Ins                    3772837           0
> Swap Outs                   3932295           0
> 
> Massive amounts of swap in/out without the patch
> 
> Direct pages scanned        6013214           0
> Kswapd pages scanned              0           0
> Kswapd pages reclaimed            0           0
> Direct pages reclaimed      4033009           0
> 
> Lots of reclaim activity without the patch
> 
> Kswapd efficiency              100%        100%
> Kswapd velocity               0.000       0.000
> Direct efficiency               67%        100%
> Direct velocity           11191.956       0.000
> 
> Mostly from direct reclaim context as you'd expect without the patch.
> 
> Page writes by reclaim  3932314.000       0.000
> Page writes file                 19           0
> Page writes anon            3932295           0
> Page reclaim immediate        42336           0
> 
> Writes from reclaim context is never good but the patch eliminates it.
> 
> We should never have default behaviour to thrash the system for such a
> basic workload. If zone reclaim mode behaviour is ever desired but on a
> single task instead of a global basis then the sensible option is to build
> a mempolicy that enforces that behaviour.

Thanks a lot for numbers Mel!
David Rientjes Oct. 4, 2018, 8:16 p.m. UTC | #3
On Tue, 25 Sep 2018, Michal Hocko wrote:

> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index da858f794eb6..149b6f4cf023 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2046,8 +2046,36 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
>  		nmask = policy_nodemask(gfp, pol);
>  		if (!nmask || node_isset(hpage_node, *nmask)) {
>  			mpol_cond_put(pol);
> -			page = __alloc_pages_node(hpage_node,
> -						gfp | __GFP_THISNODE, order);
> +			/*
> +			 * We cannot invoke reclaim if __GFP_THISNODE
> +			 * is set. Invoking reclaim with
> +			 * __GFP_THISNODE set, would cause THP
> +			 * allocations to trigger heavy swapping
> +			 * despite there may be tons of free memory
> +			 * (including potentially plenty of THP
> +			 * already available in the buddy) on all the
> +			 * other NUMA nodes.
> +			 *
> +			 * At most we could invoke compaction when
> +			 * __GFP_THISNODE is set (but we would need to
> +			 * refrain from invoking reclaim even if
> +			 * compaction returned COMPACT_SKIPPED because
> +			 * there wasn't not enough memory to succeed
> +			 * compaction). For now just avoid
> +			 * __GFP_THISNODE instead of limiting the
> +			 * allocation path to a strict and single
> +			 * compaction invocation.
> +			 *
> +			 * Supposedly if direct reclaim was enabled by
> +			 * the caller, the app prefers THP regardless
> +			 * of the node it comes from so this would be
> +			 * more desiderable behavior than only
> +			 * providing THP originated from the local
> +			 * node in such case.
> +			 */
> +			if (!(gfp & __GFP_DIRECT_RECLAIM))
> +				gfp |= __GFP_THISNODE;
> +			page = __alloc_pages_node(hpage_node, gfp, order);
>  			goto out;
>  		}
>  	}

This causes, on average, a 13.9% access latency regression on Haswell, and 
the regression would likely be more severe on Naples and Rome.

There exist libraries that allow the .text segment of processes to be 
remapped to memory backed by transparent hugepages and use MADV_HUGEPAGE 
to stress local compaction to defragment node local memory for hugepages 
at startup.  The cost, including the statistics Mel gathered, is 
acceptable for these processes: they are not concerned with startup cost, 
they are concerned only with optimal access latency while they are 
running.

So while it may take longer to start the process because memory compaction 
is attempting to allocate hugepages with __GFP_DIRECT_RECLAIM, in the 
cases where compaction is successful, this is a very significant long-term 
win.  In cases where compaction fails, falling back to local pages of the 
native page size instead of remote thp is a win for the remaining time 
this process wins: as stated, 13.9% faster for all memory accesses to the 
process's text while it runs on Haswell.

So these processes, and there are 100's of users of these libraries, 
explicitly want to incur the cost of startup to attempt to get local thp 
and fallback only when necessary to local pages of the native page size.  
They want the behavior this patch is changing and regress if the patch is 
merged.

This patch does not provide an alternative beyond MPOL_BIND to preserve 
the existing behavior.  In that case, if local memory is actually oom, we 
unnecessarily oom kill processes which would be completely unacceptable to 
these users since they are fine to accept 10-20% of their memory being 
faulted remotely if necessary to prevent processes being oom killed.

These libraries were implemented with the existing behavior of 
MADV_HUGEPAGE in mind and I cannot ask for 100s of binaries to be rebuilt 
to enforce new constraints specifically for hugepages with no alternative 
that does not allow unnecessary oom killing.

There are ways to address this without introducing regressions for 
existing users of MADV_HUGEPAGE: introduce an madvise() mode to accept 
remote thp allocations, which users of this library would never set, or 
fix memory compaction so that it does not incur substantial allocation 
latency when it will likely fail.

We don't introduce 13.9% regressions for binaries that are correctly using 
MADV_HUGEPAGE as it is implemented.

Nacked-by: David Rientjes <rientjes@google.com>
Andrea Arcangeli Oct. 4, 2018, 9:10 p.m. UTC | #4
Hello David,

On Thu, Oct 04, 2018 at 01:16:32PM -0700, David Rientjes wrote:
> There are ways to address this without introducing regressions for 
> existing users of MADV_HUGEPAGE: introduce an madvise() mode to accept 
> remote thp allocations, which users of this library would never set, or 
> fix memory compaction so that it does not incur substantial allocation 
> latency when it will likely fail.

These librarians needs to call a new MADV_ and the current
MADV_HUGEPAGE should not be affected because the new MADV_ will
require some capbility (i.e. root privilege).

qemu was the first user of MADV_HUGEPAGE and I don't think it's fair
to break it and require change to it to run at higher privilege to
retain the direct compaction behavior of MADV_HUGEPAGE.

The new behavior you ask to retain in MADV_HUGEPAGE, generated the
same misbehavior to VM as mlock could have done too, so it can't just
be given by default without any privilege whatsoever.

Ok you could mitigate the breakage that MADV_HUGEPAGE could have
generated (before the recent fix) by isolating malicious or
inefficient programs with memcg, but by default in a multiuser system
without cgroups the global disruption provided before the fix
(i.e. the pathological THP behavior) is not warranted. memcg shouldn't
be mandatory to avoid a process to affect the VM in such a strong way
(i.e. all other processes who happened to be allocated in the node
where the THP allocation triggered, being trashed in swap like if all
memory of all other nodes was not completely free).

Not only that, it's not only about malicious processes it's also
excessively inefficient for processes that just don't fit in a local
node and use MADV_HUGEPAGE. Your processes all fit in the local node
for sure if they're happy about it. This was reported as a
"pathological THP regression" after all in a workload that couldn't
swap at all because of the iommu gup persistent refcount pins.

The alternative patch I posted which still invoked direct reclaim
locally, and falled back to NUMA-local PAGE_SIZEd allocations instead
of falling back to NUMA-remote THP, could have been given without
privilege, but it still won't swapout as this patch would have done,
so it still won't go as far as the MADV_HUGEPAGE behavior had before
the fix (for the lib users that strongly needed local memory).

Overall I think the call about the default behavior of MADV_HUGEPAGE
is still between removing __GFP_THISNODE if gfp_flags can reclaim (the
fix in -mm), or by changing direct compaction to only call compaction
and not reclaim (i.e. __GFP_COMPACT_ONLY) when __GFP_THISNODE is set.

To go beyond that some privilege is needed and a new MADV_ flag can
require privilege or return error if there's not enough privilege. So
the lib with 100's users can try to use that new flag first, show an
error in stderr (maybe under debug), and fallback to MADV_HUGEPAGE if
the app hasn't enough privilege. The alternative is to add a new mem
policy less strict than MPOL_BIND to achieve what you need on top of
MADV_HUGEPAGE (which also would require some privilege of course as
all mbinds). I assume you already evaluated the preferred and local
mbinds and it's not a perfect fit?

If we keep this as a new MADV_HUGEPAGE_FORCE_LOCAL flag, you could
still add a THP sysfs/sysctl control to lift the privilege requirement
marking it as insecure setting in docs
(mm/transparent_hugepage/madv_hugepage_force_local=0|1 forced to 0 by
default). This would be on the same lines of other sysctl that
increase the max number of files open and such things (perhaps a
sysctl would be better in fact for tuning in /etc/sysctl.conf).

Note there was still some improvement left possible in my
__GFP_COMPACT_ONLY patch alternative. Notably if the watermarks for
the local node shown the local node not to have enough real "free"
PAGE_SIZEd pages to succeed the PAGE_SIZEd local THP allocation if
compaction failed, we should have relaxed __GFP_THISNODE and tried to
allocate THP from the NUMA-remote nodes before falling back to
PAGE_SIZEd allocations. That also won't require any new privilege.

To get the same behavior as before though you would need to drop all
caches with echo 3 >drop_caches before the app starts (and it still
won't swap anon memory which previous MADV_HUGEPAGE behavior would
have, but the whole point is that the previous MADV_HUGEPAGE behavior
would have backfired for the lib too if the app was allocating so much
RAM from the local node that it required swapouts of local anon
memory).

Thanks,
Andrea
David Rientjes Oct. 4, 2018, 11:05 p.m. UTC | #5
On Thu, 4 Oct 2018, Andrea Arcangeli wrote:

> Hello David,
> 

Hi Andrea,

> On Thu, Oct 04, 2018 at 01:16:32PM -0700, David Rientjes wrote:
> > There are ways to address this without introducing regressions for 
> > existing users of MADV_HUGEPAGE: introduce an madvise() mode to accept 
> > remote thp allocations, which users of this library would never set, or 
> > fix memory compaction so that it does not incur substantial allocation 
> > latency when it will likely fail.
> 
> These librarians needs to call a new MADV_ and the current
> MADV_HUGEPAGE should not be affected because the new MADV_ will
> require some capbility (i.e. root privilege).
> 
> qemu was the first user of MADV_HUGEPAGE and I don't think it's fair
> to break it and require change to it to run at higher privilege to
> retain the direct compaction behavior of MADV_HUGEPAGE.
> 
> The new behavior you ask to retain in MADV_HUGEPAGE, generated the
> same misbehavior to VM as mlock could have done too, so it can't just
> be given by default without any privilege whatsoever.
> 
> Ok you could mitigate the breakage that MADV_HUGEPAGE could have
> generated (before the recent fix) by isolating malicious or
> inefficient programs with memcg, but by default in a multiuser system
> without cgroups the global disruption provided before the fix
> (i.e. the pathological THP behavior) is not warranted. memcg shouldn't
> be mandatory to avoid a process to affect the VM in such a strong way
> (i.e. all other processes who happened to be allocated in the node
> where the THP allocation triggered, being trashed in swap like if all
> memory of all other nodes was not completely free).
> 

The source of the problem needs to be addressed: memory compaction.  We 
regress because we lose __GFP_NORETRY and pointlessly try reclaim, but 
deferred compaction is supposedly going to prevent repeated (and 
unnecessary) calls to memory compaction that ends up thrashing your local 
node.

This is likely because your workload has a size greater than 2MB * the 
deferred compaction threshold, normally set at 64.  This ends up 
repeatedly calling memory compaction and ending up being expensive when it 
should fail once and not be called again in the near term.

But that's a memory compaction issue, not a thp gfp mask issue; the 
reclaim issue is responded to below.

> Not only that, it's not only about malicious processes it's also
> excessively inefficient for processes that just don't fit in a local
> node and use MADV_HUGEPAGE. Your processes all fit in the local node
> for sure if they're happy about it. This was reported as a
> "pathological THP regression" after all in a workload that couldn't
> swap at all because of the iommu gup persistent refcount pins.
> 

This patch causes an even worse regression if all system memory is 
fragmented such that thp cannot be allocated because it tries to stress 
compaction on remote nodes, which ends up unsuccessfully, not just the 
local node.

On Haswell, when all memory is fragmented (not just the local node as I 
obtained by 13.9% regression result), the patch results in a fault latency 
regression of 40.9% for MADV_HUGEPAGE region of 8GB.  This is because it 
is thrashing both nodes pointlessly instead of just failing for 
__GFP_THISNODE.

So the end result is that the patch regresses access latency forever by 
13.9% when the local node is fragmented because it is accessing remote thp 
vs local pages of the native page size, and regresses fault latency of 
40.9% when the system is fully fragmented.  The only time that fault 
latency is improved is when remote memory is not fully fragmented, but 
then you must incur the remote access latency.

> Overall I think the call about the default behavior of MADV_HUGEPAGE
> is still between removing __GFP_THISNODE if gfp_flags can reclaim (the
> fix in -mm), or by changing direct compaction to only call compaction
> and not reclaim (i.e. __GFP_COMPACT_ONLY) when __GFP_THISNODE is set.
> 

There's two issues: the expensiveness of the page allocator involving 
compaction for MADV_HUGEPAGE mappings and the desire for userspace to 
fault thp remotely and incur the 13.9% performance regression forever.

If reclaim is avoided like it should be with __GFP_NORETRY for even 
MADV_HUGEPAGE regions, you should only experience latency introduced by 
node local memory compaction.  The __GFP_NORETRY was removed by commit 
2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised 
allocations").  The current implementation of the page allocator does not 
match the expected behavior of the thp gfp flags.

Memory compaction has deferred compaction to avoid costly scanning when it 
has recently failed, and that likely needs to be addressed directly rather 
than relying on a count of how many times it has failed; if you fault more 
than 128MB at the same time, does it make sense to immediately compact 
again?  Likely not.

> To go beyond that some privilege is needed and a new MADV_ flag can
> require privilege or return error if there's not enough privilege. So
> the lib with 100's users can try to use that new flag first, show an
> error in stderr (maybe under debug), and fallback to MADV_HUGEPAGE if
> the app hasn't enough privilege. The alternative is to add a new mem
> policy less strict than MPOL_BIND to achieve what you need on top of
> MADV_HUGEPAGE (which also would require some privilege of course as
> all mbinds). I assume you already evaluated the preferred and local
> mbinds and it's not a perfect fit?
> 
> If we keep this as a new MADV_HUGEPAGE_FORCE_LOCAL flag, you could
> still add a THP sysfs/sysctl control to lift the privilege requirement
> marking it as insecure setting in docs
> (mm/transparent_hugepage/madv_hugepage_force_local=0|1 forced to 0 by
> default). This would be on the same lines of other sysctl that
> increase the max number of files open and such things (perhaps a
> sysctl would be better in fact for tuning in /etc/sysctl.conf).
> 
> Note there was still some improvement left possible in my
> __GFP_COMPACT_ONLY patch alternative. Notably if the watermarks for
> the local node shown the local node not to have enough real "free"
> PAGE_SIZEd pages to succeed the PAGE_SIZEd local THP allocation if
> compaction failed, we should have relaxed __GFP_THISNODE and tried to
> allocate THP from the NUMA-remote nodes before falling back to
> PAGE_SIZEd allocations. That also won't require any new privilege.
> 

Direct reclaim doesn't make much sense for thp allocations if compaction 
has failed, even for MADV_HUGEPAGE.  I've discounted Mel's results because 
he is using thp defrag set to "always", which includes __GFP_NORETRY but 
the default option and anything else other than "always" does not use 
__GFP_NORETRY like the page allocator believes it does:

                /*
                 * Checks for costly allocations with __GFP_NORETRY, which
                 * includes THP page fault allocations
                 */
                if (costly_order && (gfp_mask & __GFP_NORETRY)) {
                        /*
                         * If compaction is deferred for high-order allocations,
                         * it is because sync compaction recently failed. If
                         * this is the case and the caller requested a THP
                         * allocation, we do not want to heavily disrupt the
                         * system, so we fail the allocation instead of entering
                         * direct reclaim.
                         */
                        if (compact_result == COMPACT_DEFERRED)
                                goto nopage;

So he is avoiding the cost of reclaim, which you are not, specifically 
because he is using defrag == "always".  __GFP_NORETRY should be included 
for any thp allocation and it's a regression that it doesn't.
Mel Gorman Oct. 5, 2018, 7:38 a.m. UTC | #6
On Thu, Oct 04, 2018 at 01:16:32PM -0700, David Rientjes wrote:
> On Tue, 25 Sep 2018, Michal Hocko wrote:
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index da858f794eb6..149b6f4cf023 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -2046,8 +2046,36 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
> >  		nmask = policy_nodemask(gfp, pol);
> >  		if (!nmask || node_isset(hpage_node, *nmask)) {
> >  			mpol_cond_put(pol);
> > -			page = __alloc_pages_node(hpage_node,
> > -						gfp | __GFP_THISNODE, order);
> > +			/*
> > +			 * We cannot invoke reclaim if __GFP_THISNODE
> > +			 * is set. Invoking reclaim with
> > +			 * __GFP_THISNODE set, would cause THP
> > +			 * allocations to trigger heavy swapping
> > +			 * despite there may be tons of free memory
> > +			 * (including potentially plenty of THP
> > +			 * already available in the buddy) on all the
> > +			 * other NUMA nodes.
> > +			 *
> > +			 * At most we could invoke compaction when
> > +			 * __GFP_THISNODE is set (but we would need to
> > +			 * refrain from invoking reclaim even if
> > +			 * compaction returned COMPACT_SKIPPED because
> > +			 * there wasn't not enough memory to succeed
> > +			 * compaction). For now just avoid
> > +			 * __GFP_THISNODE instead of limiting the
> > +			 * allocation path to a strict and single
> > +			 * compaction invocation.
> > +			 *
> > +			 * Supposedly if direct reclaim was enabled by
> > +			 * the caller, the app prefers THP regardless
> > +			 * of the node it comes from so this would be
> > +			 * more desiderable behavior than only
> > +			 * providing THP originated from the local
> > +			 * node in such case.
> > +			 */
> > +			if (!(gfp & __GFP_DIRECT_RECLAIM))
> > +				gfp |= __GFP_THISNODE;
> > +			page = __alloc_pages_node(hpage_node, gfp, order);
> >  			goto out;
> >  		}
> >  	}
> 
> This causes, on average, a 13.9% access latency regression on Haswell, and 
> the regression would likely be more severe on Naples and Rome.
> 

That assumes that fragmentation prevents easy allocation which may very
well be the case. While it would be great that compaction or the page
allocator could be further improved to deal with fragmentation, it's
outside the scope of this patch.

> There exist libraries that allow the .text segment of processes to be 
> remapped to memory backed by transparent hugepages and use MADV_HUGEPAGE 
> to stress local compaction to defragment node local memory for hugepages 
> at startup. 

That is taking advantage of a co-incidence of the implementation.
MADV_HUGEPAGE is *advice* that huge pages be used, not what the locality
is. A hint for strong locality preferences should be separate advice
(madvise) or a separate memory policy. Doing that is outside the context
of this patch but nothing stops you introducing such a policy or madvise,
whichever you think would be best for the libraries to consume (I'm only
aware of libhugetlbfs but there might be others).

> The cost, including the statistics Mel gathered, is 
> acceptable for these processes: they are not concerned with startup cost, 
> they are concerned only with optimal access latency while they are 
> running.
> 

Then such applications at startup have the option of setting
zone_reclaim_mode during initialisation assuming a privileged helper
can be created. That would be somewhat heavy handed and a longer-term
solution would still be to create a proper memory policy of madvise flag
for those libraries.

> So while it may take longer to start the process because memory compaction 
> is attempting to allocate hugepages with __GFP_DIRECT_RECLAIM, in the 
> cases where compaction is successful, this is a very significant long-term 
> win.  In cases where compaction fails, falling back to local pages of the 
> native page size instead of remote thp is a win for the remaining time 
> this process wins: as stated, 13.9% faster for all memory accesses to the 
> process's text while it runs on Haswell.
> 

Again, I remind you that it only benefits applications that prefectly
fit into NUMA nodes. Not all applications are created with that level of
awareness and easily get thrashed if using MADV_HUGEPAGE and do not fit
into a NUMA node.

While it is unfortunate that there are specialised applications that
benefit from the current configuration, I bet there is heavier usage of
qemu affected by the bug this patch addresses than specialised
applications that both fit perfectly into NUMA nodes and are extremely
sensitive to access latencies. It's a question of causing the least harm
to the most users which is what this patch does.

If you need behaviour for more agressive reclaim or locality hints then
kindly introduce them and do not depend in MADV_HUGEPAGE accidentically
doubling up as hints about memory locality.
David Rientjes Oct. 5, 2018, 8:35 p.m. UTC | #7
On Fri, 5 Oct 2018, Mel Gorman wrote:

> > This causes, on average, a 13.9% access latency regression on Haswell, and 
> > the regression would likely be more severe on Naples and Rome.
> > 
> 
> That assumes that fragmentation prevents easy allocation which may very
> well be the case. While it would be great that compaction or the page
> allocator could be further improved to deal with fragmentation, it's
> outside the scope of this patch.
> 

Hi Mel,

The regression that Andrea is working on, correct me if I'm wrong, is 
heavy reclaim and swapping activity that is trying to desperately allocate 
local hugepages when the local node is fragmented based on advice provided 
by MADV_HUGEPAGE.

Why is it ever appropriate to do heavy reclaim and swap activity to 
allocate a transparent hugepage?  This is exactly what the __GFP_NORETRY 
check for high-order allocations is attempting to avoid, and it explicitly 
states that it is for thp faults.  The fact that we lost __GFP_NORERY for 
thp allocations for all settings, including the default setting, other 
than yours (setting of "always") is what I'm focusing on.  There is no 
guarantee that this activity will free an entire pageblock or that it is 
even worthwhile.

Why is thp memory ever being allocated without __GFP_NORETRY as the page 
allocator expects?

That aside, removing __GFP_THISNODE can make the fault latency much worse 
if remote notes are fragmented and/or reclaim has the inability to free 
contiguous memory, which it likely cannot.  This is where I measured over 
40% fault latency regression from Linus's tree with this patch on a 
fragmnented system where order-9 memory is neither available from node 0 
or node 1 on Haswell.

> > There exist libraries that allow the .text segment of processes to be 
> > remapped to memory backed by transparent hugepages and use MADV_HUGEPAGE 
> > to stress local compaction to defragment node local memory for hugepages 
> > at startup. 
> 
> That is taking advantage of a co-incidence of the implementation.
> MADV_HUGEPAGE is *advice* that huge pages be used, not what the locality
> is. A hint for strong locality preferences should be separate advice
> (madvise) or a separate memory policy. Doing that is outside the context
> of this patch but nothing stops you introducing such a policy or madvise,
> whichever you think would be best for the libraries to consume (I'm only
> aware of libhugetlbfs but there might be others).
> 

The behavior that MADV_HUGEPAGE specifies is certainly not clearly 
defined, unfortunately.  The way that an application writer may read it, 
as we have, is that it will make a stronger attempt at allocating a 
hugepage at fault.  This actually works quite well when the allocation 
correctly has __GFP_NORETRY, as it's supposed to, and compaction is 
MIGRATE_ASYNC.

So rather than focusing on what MADV_HUGEPAGE has meant over the past 2+ 
years of kernels that we have implemented based on, or what it meant prior 
to that, is a fundamental question of the purpose of direct reclaim and 
swap activity that had always been precluded before __GFP_NORETRY was 
removed in a thp allocation.  I don't think anybody in this thread wants 
14% remote access latency regression if we allocate remotely or 40% fault 
latency regression when remote nodes are fragmented as well.

Removing __GFP_THISNODE only helps when remote memory is not fragmented, 
otherwise it multiplies the problem as I've shown.

The numbers that you provide while using the non-default option to mimick 
MADV_HUGEPAGE mappings but also use __GFP_NORETRY makes the actual source 
of the problem quite easy to identify: there is an inconsistency in the 
thp gfp mask and the page allocator implementation.

> > The cost, including the statistics Mel gathered, is 
> > acceptable for these processes: they are not concerned with startup cost, 
> > they are concerned only with optimal access latency while they are 
> > running.
> > 
> 
> Then such applications at startup have the option of setting
> zone_reclaim_mode during initialisation assuming a privileged helper
> can be created. That would be somewhat heavy handed and a longer-term
> solution would still be to create a proper memory policy of madvise flag
> for those libraries.
> 

We *never* want to use zone_reclaim_mode for these allocations, that would 
be even worse, we do not want to reclaim because we have a very unlikely 
chance of making pageblocks free without the involvement of compaction.  
We want to trigger memory compaction with a well-bounded cost that 
MIGRATE_ASYNC provides and then fail.
Andrea Arcangeli Oct. 5, 2018, 11:21 p.m. UTC | #8
Hi,

On Fri, Oct 05, 2018 at 01:35:15PM -0700, David Rientjes wrote:
> Why is it ever appropriate to do heavy reclaim and swap activity to 
> allocate a transparent hugepage?  This is exactly what the __GFP_NORETRY 
> check for high-order allocations is attempting to avoid, and it explicitly 
> states that it is for thp faults.  The fact that we lost __GFP_NORERY for 
> thp allocations for all settings, including the default setting, other 
> than yours (setting of "always") is what I'm focusing on.  There is no 
> guarantee that this activity will free an entire pageblock or that it is 
> even worthwhile.

I tried to add just __GFP_NORETRY but it changes nothing. Try it
yourself if you think that can resolve the swap storm and excessive
reclaim CPU overhead... and see if it works. I didn't intend to
reinvent the wheel with __GFP_COMPACT_ONLY, if __GFP_NORETRY would
have worked. I tried adding __GFP_NORETRY first of course.

Reason why it doesn't help is: compaction fails because not enough
free RAM, reclaim is invoked, compaction succeeds, THP is allocated to
your lib user, compaction fails because not enough free RAM, reclaim
is invoked etc.. compact_result is not COMPACT_DEFERRED, but
COMPACT_SKIPPED.

See the part "reclaim is invoked" (with __GFP_THISNODE), is enough to
still create the same heavy swap storm and unfairly penalize all apps
with memory allocated in the local node like if your library had
actually the kernel privilege to run mbind or mlock, which is not ok.

Only __GFP_COMPACT_ONLY truly can avoid reclaim, the moment reclaim
can run with __GFP_THISNODE set, all bets are off and we're back to
square one, no difference (at best marginal difference) with
__GFP_NORETRY being set.

> That aside, removing __GFP_THISNODE can make the fault latency much worse 
> if remote notes are fragmented and/or reclaim has the inability to free 
> contiguous memory, which it likely cannot.  This is where I measured over 
> 40% fault latency regression from Linus's tree with this patch on a 
> fragmnented system where order-9 memory is neither available from node 0 
> or node 1 on Haswell.

Discussing the drawbacks of removing __GFP_THISNODE is an orthogonal
topic. __GFP_COMPACT_ONLY approach didn't have any of those drawbacks
about the remote latency because __GFP_THISNODE was still set at all
times, just as you like it. You seem to think __GFP_NORETRY will work
as well as __GFP_COMPACT_ONLY but it doesn't.

Calling compaction (and only compaction!) with __GFP_THISNODE set
doesn't break anything and that was what __GFP_COMPACT_ONLY was about.

> The behavior that MADV_HUGEPAGE specifies is certainly not clearly 
> defined, unfortunately.  The way that an application writer may read it, 
> as we have, is that it will make a stronger attempt at allocating a 
> hugepage at fault.  This actually works quite well when the allocation 
> correctly has __GFP_NORETRY, as it's supposed to, and compaction is 
> MIGRATE_ASYNC.

Like Mel said, your app just happens to fit in a local node, if the
user of the lib is slightly different and allocates 16G on a system
where each node is 4G, the post-fix MADV_HUGEPAGE will perform
extremely better also for the lib user.

And you know, if the lib user fits in one node, it can use mbind and
it won't hit OOM... and you'd need some capability giving the app
privilege anyway to keep MADV_HUGEPAGE as deep and unfair to the rest
of the processes running the local node (like mbind and mlock require
too).

Could you just run a test with the special lib and allocate 4 times
the size of a node, and see how the lib performs with upstream and
upstream+fix? Feel free to add __GFP_NORETRY anywhere you like in the
test of the upstream without fix.

The only constraint I would ask for the test (if the app using the lib
is not a massively multithreaded app, like qemu is, and you just
intend to run malloc(SIZEOFNODE*4); memset) is to run the app-lib
under "taskset -c 0". Otherwise NUMA balancing could move the the CPU
next to the last memory touched, which couldn't be done if each thread
accesses all ram at random from all 4 nodes at the same time (which is
a totally legitimate workload too and must not hit the "pathological
THP allocation performance").

> removed in a thp allocation.  I don't think anybody in this thread wants 
> 14% remote access latency regression if we allocate remotely or 40% fault 
> latency regression when remote nodes are fragmented as well.

Did you try the __GFP_COMPACT_ONLY patch? That won't have the 40%
fault latency already.

Also you're underestimating the benefit of THP given from remote nodes
for virt a bit, the 40% fault latency is not an issue when the
allocation is long lived, which is what MADV_HUGEPAGE is telling the
kernel, and the benefit of THP for guest is multiplied. It's more a
feature than a bug that 40% fault latency with MADV_HUGEPAGE set at
least for all long lived allocations (but if the allocations aren't
long lived, why should MADV_HUGEPAGE have been set in the first place?).

With the fix applied, if you want to add __GFP_THISNODE to compaction
only by default you still can, that solves the 40% fault latency just
like __GFP_COMPACT_ONLY patch would also have avoided that 40%
increased fault latency.

The issue here is very simple: you can't use __GFP_THISNODE for an
allocation that can invoke reclaim, unless you have mlock or mbind
higher privilege capabilities.

Compaction can be totally called with __GFP_THISNODE. Just I believe
KVM prefers the fault latency increased as it only happens during VM
warmup phase and it prefers more THP immediately. Compaction on the
remote nodes (i.e. the higher latency) is not wasted CPU for KVM, as
it would be for a short lived allocations (MADV_HUGEPAGE incidentally
is about to tell the kernel which allocations are long lived and to
tell the kernel when it's worth running compaction in direct reclaim).

> The numbers that you provide while using the non-default option to mimick 
> MADV_HUGEPAGE mappings but also use __GFP_NORETRY makes the actual source 
> of the problem quite easy to identify: there is an inconsistency in the 
> thp gfp mask and the page allocator implementation.

I'm still unconvinced about the __GFP_NORETRY arguments because I
tried it already. However if you send a patch that fixes everything by
only adding __GFP_NORETRY in every place you wish, I'd be glad to test
it and verify if it actually solves the problem. Also note:

+#define __GFP_ONLY_COMPACT     ((__force gfp_t)(___GFP_NORETRY | \
+                                                ___GFP_ONLY_COMPACT))

If ___GFP_NORETRY would have been enough __GFP_ONLY_COMPACT would have
defined to ___GFP_NORETRY alone. When I figured it wasn't enough I
added ___GFP_ONLY_COMPACT to retain the __GFP_THISNODE in the only
place it's safe to retain it (i.e. compaction).

Thanks,
Andrea
Andrea Arcangeli Oct. 6, 2018, 3:19 a.m. UTC | #9
Hello,

On Thu, Oct 04, 2018 at 04:05:26PM -0700, David Rientjes wrote:
> The source of the problem needs to be addressed: memory compaction.  We 
> regress because we lose __GFP_NORETRY and pointlessly try reclaim, but 

I commented in detail about the __GFP_NORETRY topic in the other email
so I will skip the discussion about __GFP_NORETRY in the context of
this answer except for the comment at the end of the email to the
actual code that implements __GFP_NORETRY.

> But that's a memory compaction issue, not a thp gfp mask issue; the 
> reclaim issue is responded to below.

Actually memory compaction has no issues whatsoever with
__GFP_THISNODE regardless of __GFP_NORETRY.

> This patch causes an even worse regression if all system memory is 
> fragmented such that thp cannot be allocated because it tries to stress 
> compaction on remote nodes, which ends up unsuccessfully, not just the 
> local node.
> 
> On Haswell, when all memory is fragmented (not just the local node as I 
> obtained by 13.9% regression result), the patch results in a fault latency 
> regression of 40.9% for MADV_HUGEPAGE region of 8GB.  This is because it 
> is thrashing both nodes pointlessly instead of just failing for 
> __GFP_THISNODE.

There's no I/O involved at the very least on compaction, nor we drop
any cache or shrink any slab by mistake by just invoking compaction.
Even when you hit the worst case "all nodes are 100% fragmented"
scenario that generates the 40% increased allocation latency, all
other tasks running in the local node will keep running fine, and they
won't be pushed away forcefully into swap with all their kernel cache
depleted, which is a mlock/mbind privileged behavior that the app
using the MADV_HUGEPAGE lib should not ever been able to inflict on
other processes running in the node from different users (users as in
uid).

Furthermore when you incur the worst case latency after that there's
compact deferred logic skipping compaction next time around if all
nodes were so fragmented to the point of guaranteed failure. While
there's nothing stopping reclaim to run every time COMPACT_SKIPPED is
returned just because compaction keeps succeeding as reclaim keeps
pushing more 2M amounts into swap from the local nodes.

I don't doubt with 1024 nodes things can get pretty bad when they're
all 100% fragmented, __GFP_THISNODE would win in such case, but then
what you're asking then is the __GFP_COMPACT_ONLY behavior. That will
solve it.

What we'd need probably regardless of how we solve this bug (because
not all compaction invocations are THP invocations... and we can't
keep making special cases and optimizations tailored for THP or we end
up in that same 40% higher latency for large skbs and other stuff) is
a more sophisticated COMPACT_DEFERRED logic where you can track when
remote compaction failed. Then you wait many more times before trying
a global compaction. It could be achieved with just a compact_deferred
counter in the zone/pgdat (wherever it fits best).

Overall I don't think the bug we're dealing with and the slowdown of
compaction on the remote nodes are comparable, also considering the
latter will still happen regardless if you've large skbs or other
drivers allocating large amounts of memory as an optimization.

> So the end result is that the patch regresses access latency forever by 
> 13.9% when the local node is fragmented because it is accessing remote thp 
> vs local pages of the native page size, and regresses fault latency of 
> 40.9% when the system is fully fragmented.  The only time that fault 
> latency is improved is when remote memory is not fully fragmented, but 
> then you must incur the remote access latency.

You get THP however which will reduce the TLB miss cost and maximize
TLB usage, so it depends on the app if that 13.9% cost is actually
offseted by the THP benefit or not.

It entirely depends if large part of the workload mostly fits in
in-socket CPU cache. The more the in-socket/node CPU cache pays off,
the more remote-THP also pays off. There would be definitely workloads
that would run faster, not slower, with the remote THP instead of
local PAGE_SIZEd memory. The benefit of THP is also larger for the
guest loads than for host loads, so it depends on that too.

We agree about the latency issue with a ton of RAM and thousands of
nodes, but again that can be mitigated with a NUMA friendly
COMPACT_DEFERRED logic NUMA aware. Even without such
NUMA-aware-compact_deferred logic improvement, the worst case of the
remote compaction behavior still doesn't look nearly as bad as this
bug by thinking about it. And it only is a concern for extremely large
NUMA systems (which may run the risk of running in other solubility
issues in other places if random workloads are applied to it and all
nodes are low on memory and fully fragmented which is far from common
scenario on those large systems), while the bug we fixed was hurting
badly all very common 2 nodes installs with workloads that are common
and should run fine.

> Direct reclaim doesn't make much sense for thp allocations if compaction 
> has failed, even for MADV_HUGEPAGE.  I've discounted Mel's results because 
> he is using thp defrag set to "always", which includes __GFP_NORETRY but 
> the default option and anything else other than "always" does not use 
> __GFP_NORETRY like the page allocator believes it does:
> 
>                 /*
>                  * Checks for costly allocations with __GFP_NORETRY, which
>                  * includes THP page fault allocations
>                  */
>                 if (costly_order && (gfp_mask & __GFP_NORETRY)) {
>                         /*
>                          * If compaction is deferred for high-order allocations,
>                          * it is because sync compaction recently failed. If
>                          * this is the case and the caller requested a THP
>                          * allocation, we do not want to heavily disrupt the
>                          * system, so we fail the allocation instead of entering
>                          * direct reclaim.
>                          */
>                         if (compact_result == COMPACT_DEFERRED)
>                                 goto nopage;
> 
> So he is avoiding the cost of reclaim, which you are not, specifically 
> because he is using defrag == "always".  __GFP_NORETRY should be included 
> for any thp allocation and it's a regression that it doesn't.

Compaction doesn't fail, it returns COMPACT_SKIPPED and it asks to do
a run of reclam to generate those 2M of PAGE_SIZEd memory required to
move a 2M piece into the newly freely PAGE_SIZEd fragments. So
__GFP_NORETRY never jumps to "nopage" and it never gets deferred either.

For the record, I didn't trace it literally with gdb to validate my
theory of why forcefully adding __GFP_NORETRY didn't move the needle,
so feel free to do more investigations in that area if you see any
pitfall in the theory.

Thanks,
Andrea
David Rientjes Oct. 8, 2018, 8:41 p.m. UTC | #10
On Fri, 5 Oct 2018, Andrea Arcangeli wrote:

> I tried to add just __GFP_NORETRY but it changes nothing. Try it
> yourself if you think that can resolve the swap storm and excessive
> reclaim CPU overhead... and see if it works. I didn't intend to
> reinvent the wheel with __GFP_COMPACT_ONLY, if __GFP_NORETRY would
> have worked. I tried adding __GFP_NORETRY first of course.
> 
> Reason why it doesn't help is: compaction fails because not enough
> free RAM, reclaim is invoked, compaction succeeds, THP is allocated to
> your lib user, compaction fails because not enough free RAM, reclaim
> is invoked etc.. compact_result is not COMPACT_DEFERRED, but
> COMPACT_SKIPPED.
> 
> See the part "reclaim is invoked" (with __GFP_THISNODE), is enough to
> still create the same heavy swap storm and unfairly penalize all apps
> with memory allocated in the local node like if your library had
> actually the kernel privilege to run mbind or mlock, which is not ok.
> 
> Only __GFP_COMPACT_ONLY truly can avoid reclaim, the moment reclaim
> can run with __GFP_THISNODE set, all bets are off and we're back to
> square one, no difference (at best marginal difference) with
> __GFP_NORETRY being set.
> 

The page allocator is expecting __GFP_NORETRY for thp allocations per its 
comment:

		/*
		 * Checks for costly allocations with __GFP_NORETRY, which
		 * includes THP page fault allocations
		 */
		if (costly_order && (gfp_mask & __GFP_NORETRY)) {

And that enables us to check compact_result to determine whether thp 
allocation should fail or continue to reclaim.  I don't think it helps 
that some thp fault allocations use __GFP_NORETRY and others do not.  I 
think that deserves a fix to alloc_hugepage_direct_gfpmask() or 
GFP_TRANSHUGE_LIGHT.

Our library that uses MADV_HUGEPAGE only invokes direct compaction because 
we're on an older kernel: it does not attempt to do reclaim to make 
compaction happy so that it finds memory that it can migrate memory to.  
For reference, we use defrag setting of "defer+madvise".  Everybody who 
does not use MADV_HUGEPAGE kicks off kswapd/kcompactd and fails, 
MADV_HUGEPAGE users do the same but also try direct compaction.  That 
direct compaction uses a mode of MIGRATE_ASYNC so it normally fails 
because of need_resched() or spinlock contention.

These allocations always fail, MADV_HUGEPAGE or otherwise, without 
invoking direct reclaim.

I am agreeing with both you and Mel that it makes no sense to thrash the 
local node to make compaction happy and then hugepage-order memory 
available.  I'm only differing with you on the mechanism to fail early: we 
never want to do attempt reclaim on thp allocations specifically because 
it leads to the behavior you are addressing.

My contention is that removing __GFP_THISNODE papers over the problem, 
especially in cases where remote memory is also fragmnented. It incurs a 
much higher (40%) fault latency and then incurs 13.9% greater access 
latency.  It is not a good result, at least for Haswell, Naples, and Rome.

To make a case that we should fault hugepages remotely as fallback, either 
for non-MADV_HUGEPAGE users who do not use direct compaction, or 
MADV_HUGEPAGE users who use direct compaction, we need numbers that 
suggest there is a performance benefit in terms of access latency to 
suggest that it is better; this is especially the case when the fault 
latency is 40% higher.  On Haswell, Naples, and Rome, it is quite obvious 
that this patch works much harder to fault memory remotely that incurs a 
substantial performance penalty when it fails and in the cases where it 
succeeds we have much worse access latency.

For users who bind their applications to a subset of cores on a NUMA node, 
fallback is egregious: we are guaranteed to never have local access 
latency and in the case when the local node is fragmented and remote 
memory is not our fault latency goes through the roof when local native 
pages would have been much better.

I've brought numbers on how poor this patch performs, so I'm asking for a 
rebuttal that suggests it is better on some platforms.  (1) On what 
platforms is it better to fault remote hugepages over local native pages?  
(2) What is the fault latency increase when remote nodes are fragmented as 
well?

> Like Mel said, your app just happens to fit in a local node, if the
> user of the lib is slightly different and allocates 16G on a system
> where each node is 4G, the post-fix MADV_HUGEPAGE will perform
> extremely better also for the lib user.
> 

It won't if the remote memory is fragmented; the patch is premised on the 
argument that remote memory is never fragmented or under memory pressure 
otherwise it is multiplying the fault latency by the number of nodes.  
Sure, creating test cases where the local node is under heavy memory 
pressure yet remote nodes are mostly free will minimize the impact this 
has on fault latency.  It still is a substantial increase, as I've 
measured on Haswell, but the access latency forever is the penalty.  This 
patch cannot make access to remote memory faster.

> And you know, if the lib user fits in one node, it can use mbind and
> it won't hit OOM... and you'd need some capability giving the app
> privilege anyway to keep MADV_HUGEPAGE as deep and unfair to the rest
> of the processes running the local node (like mbind and mlock require
> too).
> 

No, the lib user does not fit into a single node, we have cases where 
these nodes are under constant memory pressure.  We are on an older 
kernel, however, that fails because of need_resched() and spinlock 
contention rather than falling back to local reclaim.

> Did you try the __GFP_COMPACT_ONLY patch? That won't have the 40%
> fault latency already.
> 

I remember Kirill saying that he preferred node local memory allocation 
over thp allocation in the review, which I agree with, but it was much 
better than this patch.  I see no reason why we should work so hard to 
reclaim memory, thrash local memory, and swap so that the compaction 
freeing scanner can find memory when there is *no* guarantee that the 
migration scanner can free an entire pageblock.  That is completely 
pointless work, even for an MADV_HUGEPAGE user, and we certainly don't 
have such pathological behavior on older kernels.

> Also you're underestimating the benefit of THP given from remote nodes
> for virt a bit, the 40% fault latency is not an issue when the
> allocation is long lived, which is what MADV_HUGEPAGE is telling the
> kernel, and the benefit of THP for guest is multiplied. It's more a
> feature than a bug that 40% fault latency with MADV_HUGEPAGE set at
> least for all long lived allocations (but if the allocations aren't
> long lived, why should MADV_HUGEPAGE have been set in the first place?).
> 

For the long-term benefit of thp to outweigh the 40% fault latency 
increase, it would require that the access latency is better on remote 
memory.  It's worse, it's 13.9% worse access latency on Haswell.  It's a 
negative result for both allocation and access latency.

> With the fix applied, if you want to add __GFP_THISNODE to compaction
> only by default you still can, that solves the 40% fault latency just
> like __GFP_COMPACT_ONLY patch would also have avoided that 40%
> increased fault latency.
> 
> The issue here is very simple: you can't use __GFP_THISNODE for an
> allocation that can invoke reclaim, unless you have mlock or mbind
> higher privilege capabilities.
> 

Completely agreed, and I'm very strongly suggesting that we do not invoke 
reclaim.  Memory compaction provides no guarantee that it can free an 
entire pageblock, even for MIGRATE_ASYNC compaction under MADV_HUGEPAGE it 
will attempt to migrate SWAP_CLUSTER_MAX pages from a pageblock without 
considering if the entire pageblock could eventually become freed for 
order-9 memory.  The vast majority of the time this reclaim could be 
completely pointless.  We cannot use it.

> I'm still unconvinced about the __GFP_NORETRY arguments because I
> tried it already. However if you send a patch that fixes everything by
> only adding __GFP_NORETRY in every place you wish, I'd be glad to test
> it and verify if it actually solves the problem. Also note:
> 
> +#define __GFP_ONLY_COMPACT     ((__force gfp_t)(___GFP_NORETRY | \
> +                                                ___GFP_ONLY_COMPACT))
> 
> If ___GFP_NORETRY would have been enough __GFP_ONLY_COMPACT would have
> defined to ___GFP_NORETRY alone. When I figured it wasn't enough I
> added ___GFP_ONLY_COMPACT to retain the __GFP_THISNODE in the only
> place it's safe to retain it (i.e. compaction).
> 

We don't need __GFP_NORETRY anymore with this, it wouldn't be useful to 
continually compact memory in isolation if it already has failed.  It 
seems strange to be using __GFP_DIRECT_RECLAIM | __GFP_ONLY_COMPACT, 
though, as I assume this would evolve into.

Are there any other proposed users for __GFP_ONLY_COMPACT beyond thp 
allocations?  If not, we should just save the gfp bit and encode the logic 
directly into the page allocator.

Would you support this?
---
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -860,7 +860,7 @@ static int ttm_get_pages(struct page **pages, unsigned npages, int flags,
 			while (npages >= HPAGE_PMD_NR) {
 				gfp_t huge_flags = gfp_flags;
 
-				huge_flags |= GFP_TRANSHUGE_LIGHT | __GFP_NORETRY |
+				huge_flags |= GFP_TRANSHUGE_LIGHT |
 					__GFP_KSWAPD_RECLAIM;
 				huge_flags &= ~__GFP_MOVABLE;
 				huge_flags &= ~__GFP_COMP;
@@ -978,13 +978,13 @@ int ttm_page_alloc_init(struct ttm_mem_global *glob, unsigned max_pages)
 				  GFP_USER | GFP_DMA32, "uc dma", 0);
 
 	ttm_page_pool_init_locked(&_manager->wc_pool_huge,
-				  (GFP_TRANSHUGE_LIGHT | __GFP_NORETRY |
+				  (GFP_TRANSHUGE_LIGHT |
 				   __GFP_KSWAPD_RECLAIM) &
 				  ~(__GFP_MOVABLE | __GFP_COMP),
 				  "wc huge", order);
 
 	ttm_page_pool_init_locked(&_manager->uc_pool_huge,
-				  (GFP_TRANSHUGE_LIGHT | __GFP_NORETRY |
+				  (GFP_TRANSHUGE_LIGHT |
 				   __GFP_KSWAPD_RECLAIM) &
 				  ~(__GFP_MOVABLE | __GFP_COMP)
 				  , "uc huge", order);
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -298,7 +298,8 @@ struct vm_area_struct;
 #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
 #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
 #define GFP_TRANSHUGE_LIGHT	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
-			 __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)
+			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
+			~__GFP_RECLAIM)
 #define GFP_TRANSHUGE	(GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM)
 
 /* Convert GFP flags to their corresponding migrate type */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -628,13 +628,15 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
  * madvise: directly stall for MADV_HUGEPAGE, otherwise fail if not immediately
  *	    available
  * never: never stall for any thp allocation
+ *
+ * "Stalling" here implies direct memory compaction but not direct reclaim.
  */
 static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 {
 	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
 
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
+		return GFP_TRANSHUGE;
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4145,6 +4145,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 			if (compact_result == COMPACT_DEFERRED)
 				goto nopage;
 
+			/*
+			 * If faulting a hugepage, it is very unlikely that
+			 * thrashing the zonelist is going to assist compaction
+			 * in freeing an entire pageblock.  There are no
+			 * guarantees memory compaction can free an entire
+			 * pageblock under such memory pressure that it is
+			 * better to simply fail and fallback to native pages.
+			 */
+			if (order == pageblock_order &&
+					!(current->flags & PF_KTHREAD))
+				goto nopage;
+
 			/*
 			 * Looks like reclaim/compaction is worth trying, but
 			 * sync compaction could be very expensive, so keep
Mel Gorman Oct. 9, 2018, 9:48 a.m. UTC | #11
On Mon, Oct 08, 2018 at 01:41:09PM -0700, David Rientjes wrote:
> On Fri, 5 Oct 2018, Andrea Arcangeli wrote:
> 
> > I tried to add just __GFP_NORETRY but it changes nothing. Try it
> > yourself if you think that can resolve the swap storm and excessive
> > reclaim CPU overhead... and see if it works. I didn't intend to
> > reinvent the wheel with __GFP_COMPACT_ONLY, if __GFP_NORETRY would
> > have worked. I tried adding __GFP_NORETRY first of course.
> > 
> > Reason why it doesn't help is: compaction fails because not enough
> > free RAM, reclaim is invoked, compaction succeeds, THP is allocated to
> > your lib user, compaction fails because not enough free RAM, reclaim
> > is invoked etc.. compact_result is not COMPACT_DEFERRED, but
> > COMPACT_SKIPPED.
> > 
> > See the part "reclaim is invoked" (with __GFP_THISNODE), is enough to
> > still create the same heavy swap storm and unfairly penalize all apps
> > with memory allocated in the local node like if your library had
> > actually the kernel privilege to run mbind or mlock, which is not ok.
> > 
> > Only __GFP_COMPACT_ONLY truly can avoid reclaim, the moment reclaim
> > can run with __GFP_THISNODE set, all bets are off and we're back to
> > square one, no difference (at best marginal difference) with
> > __GFP_NORETRY being set.
> > 
> 
> The page allocator is expecting __GFP_NORETRY for thp allocations per its 
> comment:
> 
> 		/*
> 		 * Checks for costly allocations with __GFP_NORETRY, which
> 		 * includes THP page fault allocations
> 		 */
> 		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
> 
> And that enables us to check compact_result to determine whether thp 
> allocation should fail or continue to reclaim.  I don't think it helps 
> that some thp fault allocations use __GFP_NORETRY and others do not.  I 
> think that deserves a fix to alloc_hugepage_direct_gfpmask() or 
> GFP_TRANSHUGE_LIGHT.
> 

I am concerned that we may be trying to deal with this in terms of right
and wrong when the range of workloads and their preferred semantics
force us into a grey area.

The THP user without __GFP_NORETRY is "always"

always
        means that an application requesting THP will stall on
        allocation failure and directly reclaim pages and compact
        memory in an effort to allocate a THP immediately. This may be
        desirable for virtual machines that benefit heavily from THP
        use and are willing to delay the VM start to utilise them.

Removing __GFP_NORETRY in this instance is due to the tuning hinting that
direct reclaim is prefectly acceptable. __GFP_NORETRY means that the kernel
will not compact memory after a recent failure but if the caller is willing
to reclaim memory, then it follows that retrying compaction is justified.

Is this the correct thing to do in 100% of cases? Probably not, there
will be corner cases but it also does not necessarily follow that
__GFP_NORETRY should be univeral.

What is missing here is an agreed upon set of reference test cases
that can be used for the basis of evaluating patches like this. They
should be somewhat representative of the target applications of virtual
memory initialisation (forget about runtime at the moment as that is
stacking problems), a simulator of the google workload and library and
my test case of simply referencing an amount of memory larger than one
node. That would cover the current discussion at least but more would be
needed later. Otherwise we're going to endlessly whack-a-mole fixing one
workload and hurting another. It might be overkill but otherwise this
discussion risks going in circles.

The previous reference cases were ones that focused on either THP
allocation success rates or the benefit of THP itself, neither of which
are particularly useful in the current context.

> Our library that uses MADV_HUGEPAGE only invokes direct compaction because 
> we're on an older kernel: it does not attempt to do reclaim to make 
> compaction happy so that it finds memory that it can migrate memory to.  
> For reference, we use defrag setting of "defer+madvise".  Everybody who 
> does not use MADV_HUGEPAGE kicks off kswapd/kcompactd and fails, 
> MADV_HUGEPAGE users do the same but also try direct compaction.  That 
> direct compaction uses a mode of MIGRATE_ASYNC so it normally fails 
> because of need_resched() or spinlock contention.
> 
> These allocations always fail, MADV_HUGEPAGE or otherwise, without 
> invoking direct reclaim.
> 
> I am agreeing with both you and Mel that it makes no sense to thrash the 
> local node to make compaction happy and then hugepage-order memory 
> available.  I'm only differing with you on the mechanism to fail early: we 
> never want to do attempt reclaim on thp allocations specifically because 
> it leads to the behavior you are addressing.
> 
> My contention is that removing __GFP_THISNODE papers over the problem, 
> especially in cases where remote memory is also fragmnented. It incurs a 
> much higher (40%) fault latency and then incurs 13.9% greater access 
> latency.  It is not a good result, at least for Haswell, Naples, and Rome.
> 

Ok, this is a good point. MADV_HUGEPAGE is hinting that huge pages
are desired and in recent kernels, that also meant they would be local
pages i.e. locality is more important than huge pages even if huge pages
are desirable. With this change, it indicates that huge pages are more
important than locality because that is what userspace hinted.

Which is better?

locality is more important if your workload fits in memory and the
	initialisation phase is not performance-critical. It would
	also be more important if your workload is larger than a
	node but the critical working set fits within a node while
	the other usages are like streaming readers and writers where
	the data is referenced once and can be safely reclaimed later.
huge pages is more important if your workload is virtualised and
	benefits heavily due to reduced TLB cost from EPT.
	Similarly it's better if the workload has high special locality,
	particularly if it is also fitting within a cache where the
	remote cost is masked.

These are simple examples and even then we cannot detect which case
applies in advance so it falls back to what does the hint mean? The name
suggests that huge pages are desirable and the locality is a hint. Granted,
if that means THP is going remote and incurring cost there, it might be
worse overall for some workloads, particularly if the system is fragmented.

This goes back to my point that the MADV_HUGEPAGE hint should not make
promises about locality and that introducing MADV_LOCAL for specialised
libraries may be more appropriate with the initial semantic being how it
treats MADV_HUGEPAGE regions.

> To make a case that we should fault hugepages remotely as fallback, either 
> for non-MADV_HUGEPAGE users who do not use direct compaction, or 
> MADV_HUGEPAGE users who use direct compaction, we need numbers that 
> suggest there is a performance benefit in terms of access latency to 
> suggest that it is better; this is especially the case when the fault 
> latency is 40% higher.  On Haswell, Naples, and Rome, it is quite obvious 
> that this patch works much harder to fault memory remotely that incurs a 
> substantial performance penalty when it fails and in the cases where it 
> succeeds we have much worse access latency.
> 
> For users who bind their applications to a subset of cores on a NUMA node, 
> fallback is egregious: we are guaranteed to never have local access 
> latency and in the case when the local node is fragmented and remote 
> memory is not our fault latency goes through the roof when local native 
> pages would have been much better.
> 
> I've brought numbers on how poor this patch performs, so I'm asking for a 
> rebuttal that suggests it is better on some platforms.  (1) On what 
> platforms is it better to fault remote hugepages over local native pages?  
> (2) What is the fault latency increase when remote nodes are fragmented as 
> well?
> 

On the flip side, we've also heard how some other applications are
adversely affected such as virtual machine start-ups.

What you're asking for asking is an astonishing amount of work --
a multi platform study against a wide variety of workloads with the
addition that some test should be able to start in a fragmented state
that is reproducible.

What is being asked of you is to consider introducing a new madvise hint and
having MADV_HUGEPAGE being about huge pages and introducing a new hint that
is hinting about locality without the strictness of memory binding. That
is significantly more tractable and you also presumably have access to
a reasonable reproduction cases even though I doubt you can release it.

> > Like Mel said, your app just happens to fit in a local node, if the
> > user of the lib is slightly different and allocates 16G on a system
> > where each node is 4G, the post-fix MADV_HUGEPAGE will perform
> > extremely better also for the lib user.
> > 
> 
> It won't if the remote memory is fragmented; the patch is premised on the 
> argument that remote memory is never fragmented or under memory pressure 
> otherwise it is multiplying the fault latency by the number of nodes.  
> Sure, creating test cases where the local node is under heavy memory 
> pressure yet remote nodes are mostly free will minimize the impact this 
> has on fault latency.  It still is a substantial increase, as I've 
> measured on Haswell, but the access latency forever is the penalty.  This 
> patch cannot make access to remote memory faster.
> 

Indeed not but remote costs are highly platform dependant so decisions
made for haswell are not necessarily justified on EPYC or any later Intel
processor either. Because these goalposts will keep changing, I think
it's better to have more clearly defined madvise hints and rely less on
implementation details that are subject to change. Otherwise we will get
stuck in an endless debate about "is workload X more important than Y"?

No doubt this patch will have problems but it treats MADV_HUGEPAGE as a
more stricter hint that huge pages are desirable.

> > <SNIP>
> >
> > I'm still unconvinced about the __GFP_NORETRY arguments because I
> > tried it already. However if you send a patch that fixes everything by
> > only adding __GFP_NORETRY in every place you wish, I'd be glad to test
> > it and verify if it actually solves the problem. Also note:
> > 
> > +#define __GFP_ONLY_COMPACT     ((__force gfp_t)(___GFP_NORETRY | \
> > +                                                ___GFP_ONLY_COMPACT))
> > 
> > If ___GFP_NORETRY would have been enough __GFP_ONLY_COMPACT would have
> > defined to ___GFP_NORETRY alone. When I figured it wasn't enough I
> > added ___GFP_ONLY_COMPACT to retain the __GFP_THISNODE in the only
> > place it's safe to retain it (i.e. compaction).
> > 
> 
> We don't need __GFP_NORETRY anymore with this, it wouldn't be useful to 
> continually compact memory in isolation if it already has failed.  It 
> seems strange to be using __GFP_DIRECT_RECLAIM | __GFP_ONLY_COMPACT, 
> though, as I assume this would evolve into.
> 
> Are there any other proposed users for __GFP_ONLY_COMPACT beyond thp 
> allocations?  If not, we should just save the gfp bit and encode the logic 
> directly into the page allocator.
> 
> Would you support this?

I don't think it's necessarily bad but it cannot distinguish between
THP and hugetlbfs. Hugetlbfs users are typically more willing to accept
high overheads as they may be required for the application to function.
That's probably fixable but will still leave us in the state where
MADV_HUGEPAGE is also a hint about locality. It'd still be interesting
to hear if it fixes the VM initialisation issue but do note that if this
patch is used as a replacement that hugetlbfs users may complain down
the line.
Michal Hocko Oct. 9, 2018, 12:27 p.m. UTC | #12
[Sorry for being slow in responding but I was mostly offline last few
 days]

On Tue 09-10-18 10:48:25, Mel Gorman wrote:
[...]
> This goes back to my point that the MADV_HUGEPAGE hint should not make
> promises about locality and that introducing MADV_LOCAL for specialised
> libraries may be more appropriate with the initial semantic being how it
> treats MADV_HUGEPAGE regions.

I agree with your other points and not going to repeat them. I am not
sure madvise s the best API for the purpose though. We are talking about
memory policy here and there is an existing api for that so I would
_prefer_ to reuse it for this purpose.

Sure we will likely need somethin in the compaction as well but we
should start simple and go forward in smaller steps.
Mel Gorman Oct. 9, 2018, 1 p.m. UTC | #13
On Tue, Oct 09, 2018 at 02:27:45PM +0200, Michal Hocko wrote:
> [Sorry for being slow in responding but I was mostly offline last few
>  days]
> 
> On Tue 09-10-18 10:48:25, Mel Gorman wrote:
> [...]
> > This goes back to my point that the MADV_HUGEPAGE hint should not make
> > promises about locality and that introducing MADV_LOCAL for specialised
> > libraries may be more appropriate with the initial semantic being how it
> > treats MADV_HUGEPAGE regions.
> 
> I agree with your other points and not going to repeat them. I am not
> sure madvise s the best API for the purpose though. We are talking about
> memory policy here and there is an existing api for that so I would
> _prefer_ to reuse it for this purpose.
> 

I flip-flopped on that one in my head multiple times on the basis of
how strict it should be. Memory policies tend to be black or white --
bind here, interleave there, etc. It wasn't clear to me what the best
policy would be to describe "allocate local as best as you can but allow
fallbacks if necessary". Hence, I started leaning towards advise as it is
really about advice that the kernel can ignore if necessary. That said,
I don't feel as strongly about the "how" as I do about the fact that
applications and libraries should not depend on side-effects of the
MADV_HUGEPAGE implementation that relate to locality.
Vlastimil Babka Oct. 9, 2018, 1:08 p.m. UTC | #14
On 10/8/18 10:41 PM, David Rientjes wrote:
> +			/*
> +			 * If faulting a hugepage, it is very unlikely that
> +			 * thrashing the zonelist is going to assist compaction
> +			 * in freeing an entire pageblock.  There are no
> +			 * guarantees memory compaction can free an entire
> +			 * pageblock under such memory pressure that it is
> +			 * better to simply fail and fallback to native pages.
> +			 */
> +			if (order == pageblock_order &&
> +					!(current->flags & PF_KTHREAD))
> +				goto nopage;

After we got rid of similar hardcoded heuristics, I would be very
unhappy to start adding them back. A new gfp flag is also unfortunate,
but more acceptable to me.

> +
>  			/*
>  			 * Looks like reclaim/compaction is worth trying, but
>  			 * sync compaction could be very expensive, so keep
>
Michal Hocko Oct. 9, 2018, 2:25 p.m. UTC | #15
On Tue 09-10-18 14:00:34, Mel Gorman wrote:
> On Tue, Oct 09, 2018 at 02:27:45PM +0200, Michal Hocko wrote:
> > [Sorry for being slow in responding but I was mostly offline last few
> >  days]
> > 
> > On Tue 09-10-18 10:48:25, Mel Gorman wrote:
> > [...]
> > > This goes back to my point that the MADV_HUGEPAGE hint should not make
> > > promises about locality and that introducing MADV_LOCAL for specialised
> > > libraries may be more appropriate with the initial semantic being how it
> > > treats MADV_HUGEPAGE regions.
> > 
> > I agree with your other points and not going to repeat them. I am not
> > sure madvise s the best API for the purpose though. We are talking about
> > memory policy here and there is an existing api for that so I would
> > _prefer_ to reuse it for this purpose.
> > 
> 
> I flip-flopped on that one in my head multiple times on the basis of
> how strict it should be. Memory policies tend to be black or white --
> bind here, interleave there, etc. It wasn't clear to me what the best
> policy would be to describe "allocate local as best as you can but allow
> fallbacks if necessary".

I was thinking about MPOL_NODE_PROXIMITY with the following semantic:
- try hard to allocate from a local or very close numa node(s) even when
that requires expensive operations like the memory reclaim/compaction
before falling back to other more distant numa nodes.


> Hence, I started leaning towards advise as it is
> really about advice that the kernel can ignore if necessary. That said,
> I don't feel as strongly about the "how" as I do about the fact that
> applications and libraries should not depend on side-effects of the
> MADV_HUGEPAGE implementation that relate to locality.

completely agreed on this.
Mel Gorman Oct. 9, 2018, 3:16 p.m. UTC | #16
On Tue, Oct 09, 2018 at 04:25:10PM +0200, Michal Hocko wrote:
> On Tue 09-10-18 14:00:34, Mel Gorman wrote:
> > On Tue, Oct 09, 2018 at 02:27:45PM +0200, Michal Hocko wrote:
> > > [Sorry for being slow in responding but I was mostly offline last few
> > >  days]
> > > 
> > > On Tue 09-10-18 10:48:25, Mel Gorman wrote:
> > > [...]
> > > > This goes back to my point that the MADV_HUGEPAGE hint should not make
> > > > promises about locality and that introducing MADV_LOCAL for specialised
> > > > libraries may be more appropriate with the initial semantic being how it
> > > > treats MADV_HUGEPAGE regions.
> > > 
> > > I agree with your other points and not going to repeat them. I am not
> > > sure madvise s the best API for the purpose though. We are talking about
> > > memory policy here and there is an existing api for that so I would
> > > _prefer_ to reuse it for this purpose.
> > > 
> > 
> > I flip-flopped on that one in my head multiple times on the basis of
> > how strict it should be. Memory policies tend to be black or white --
> > bind here, interleave there, etc. It wasn't clear to me what the best
> > policy would be to describe "allocate local as best as you can but allow
> > fallbacks if necessary".
> 
> I was thinking about MPOL_NODE_PROXIMITY with the following semantic:
> - try hard to allocate from a local or very close numa node(s) even when
> that requires expensive operations like the memory reclaim/compaction
> before falling back to other more distant numa nodes.
> 

Seems reasonable. It's not far from the general semantics I thought
MADV_LOCAL would have.
David Rientjes Oct. 9, 2018, 10:17 p.m. UTC | #17
On Tue, 9 Oct 2018, Mel Gorman wrote:

> > The page allocator is expecting __GFP_NORETRY for thp allocations per its 
> > comment:
> > 
> > 		/*
> > 		 * Checks for costly allocations with __GFP_NORETRY, which
> > 		 * includes THP page fault allocations
> > 		 */
> > 		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
> > 
> > And that enables us to check compact_result to determine whether thp 
> > allocation should fail or continue to reclaim.  I don't think it helps 
> > that some thp fault allocations use __GFP_NORETRY and others do not.  I 
> > think that deserves a fix to alloc_hugepage_direct_gfpmask() or 
> > GFP_TRANSHUGE_LIGHT.
> > 
> 
> I am concerned that we may be trying to deal with this in terms of right
> and wrong when the range of workloads and their preferred semantics
> force us into a grey area.
> 
> The THP user without __GFP_NORETRY is "always"
> 
> always
>         means that an application requesting THP will stall on
>         allocation failure and directly reclaim pages and compact
>         memory in an effort to allocate a THP immediately. This may be
>         desirable for virtual machines that benefit heavily from THP
>         use and are willing to delay the VM start to utilise them.
> 

I'm not sure what you mean when you say the thp user without __GFP_NORETRY 
is "always".  If you're referring to the defrag mode "always", 
__GFP_NORETRY is only possible for non-madvised memory regions.  All other 
defrag modes exclude it, and I argue they incorrectly exclude it.

I believe the userspace expectation, whether it is described well or not, 
is that defrag mode "always" simply gets the same behavior for defrag mode 
"madvise" but for everybody, not just MADV_HUGEPAGE regions.  Then, 
"defer+madvise" triggers kswapd/kcompactd but compacts for MADV_HUGEPAGE, 
"defer" triggers kswapd/compactd and fails, and "never" fails.  This, to 
me, seems like the most sane heuristic.

Whether reclaim is helpful or not is what I'm questioning, I don't think 
that it benefits any user to have strict adherence to a documentation that 
causes workloads to severely regress both in fault and access latency when 
we know that direct reclaim is unlikely to make direct compaction free an 
entire pageblock.  It's more likely than not that the reclaim was 
pointless and the allocation will still fail.

If memory compaction were patched such that it can report that it could 
successfully free a page of the specified order if there were free pages 
at the end of the zone it could migrate to, reclaim might be helpful.  But 
with the current implementation, I don't think that is reliably possible.  
These free pages could easily be skipped over by the migration scanner 
because of the presence of slab pages, for example, and unavailable to the 
freeing scanner.

> Removing __GFP_NORETRY in this instance is due to the tuning hinting that
> direct reclaim is prefectly acceptable. __GFP_NORETRY means that the kernel
> will not compact memory after a recent failure but if the caller is willing
> to reclaim memory, then it follows that retrying compaction is justified.
> 
> Is this the correct thing to do in 100% of cases? Probably not, there
> will be corner cases but it also does not necessarily follow that
> __GFP_NORETRY should be univeral.
> 

For reclaim to be useful, I believe we need a major rewrite of memory 
compaction such that the freeing scanner is eliminated and it can benefit 
from memory passed over by the migration scanner.  In this case, it would 
require the freeing scanner to do the actual reclaim itself.  I don't 
believe it is beneficial to any user that we reclaim memory from the zone 
when we don't know (1) that the migration scanner will eventually be able 
to free an entire pageblock, (2) the reclaimed memory can be accessed by 
the freeing scanner as a migration target, and (3) all this synchronous 
work was not done on behalf of a user only to lose the hugepage to a 
concurrent allocator that is not even madvised.

Since we lack all of this in the allocator/compaction/reclaim, it seems 
that the painful fault latency here can be corrected by doing what is 
actually useful, memory compaction, and rely on its heuristics of when to 
give up and when to proceed rather than thrash the zone.  The insane fault 
latency being reported here is certainly not what the user is asking for 
when it does MADV_HUGEPAGE, and removing __GFP_THISNODE doesn't help in 
any way if remote memory is fragmented or low on memory as well.

> What is missing here is an agreed upon set of reference test cases
> that can be used for the basis of evaluating patches like this. They
> should be somewhat representative of the target applications of virtual
> memory initialisation (forget about runtime at the moment as that is
> stacking problems), a simulator of the google workload and library and
> my test case of simply referencing an amount of memory larger than one
> node. That would cover the current discussion at least but more would be
> needed later. Otherwise we're going to endlessly whack-a-mole fixing one
> workload and hurting another. It might be overkill but otherwise this
> discussion risks going in circles.
> 

Considering only fault latency, it's relatively trivial to fragment all 
zones and then measure the time it takes to start your workload.  This is 
where the 40% regression I reported came from.  Having tons of memory free 
remotely yet having the local node under such strenuous memory pressure 
that compaction cannot even find free pages isn't a healthy balance.  
Perhaps we do not fault remote memory as easily because there is not an 
imbalance.

My criticism is obviously based on the fault and access latency 
regressions that this introduces but is also focused on what makes sense 
for the current implementation of the page allocator, memory compaction, 
and direct reclaim.  I know that direct reclaim is these contexts will 
very, very seldom help memory compaction because it requires strenuous 
memory pressure (otherwise compaction would have tried and failed to 
defrag memory for other reasons) and under those situations we have data 
that shows slab fragmentation is much more likely to occur such that 
compaction will never be successful.  Situations where a node has 1.5GB of 
slab yet 100GB of pageblocks have 1+ slab pages because the allocator 
prefers to return node local memory at the cost of fragmenting 
MIGRATE_MOVABLE pageblocks.

I really think that for this patch to be merged over my proposed change 
that it needs to be clearly demonstrated that reclaim was successful in 
that it freed memory that was subsequently available to the compaction 
freeing scanner and that enabled entire pageblocks to become free.  That, 
in my experience, will very very seldom be successful because of internal 
slab fragmentation, compaction_alloc() cannot soak up pages from the 
reclaimed memory, and potentially thrash the zone completely pointlessly.  
The last point is the problem being reported here, but the other two are 
as legitimate.

> Which is better?
> 
> locality is more important if your workload fits in memory and the
> 	initialisation phase is not performance-critical. It would
> 	also be more important if your workload is larger than a
> 	node but the critical working set fits within a node while
> 	the other usages are like streaming readers and writers where
> 	the data is referenced once and can be safely reclaimed later.
> huge pages is more important if your workload is virtualised and
> 	benefits heavily due to reduced TLB cost from EPT.
> 	Similarly it's better if the workload has high special locality,
> 	particularly if it is also fitting within a cache where the
> 	remote cost is masked.
> 
> These are simple examples and even then we cannot detect which case
> applies in advance so it falls back to what does the hint mean? The name
> suggests that huge pages are desirable and the locality is a hint. Granted,
> if that means THP is going remote and incurring cost there, it might be
> worse overall for some workloads, particularly if the system is fragmented.
> 
> This goes back to my point that the MADV_HUGEPAGE hint should not make
> promises about locality and that introducing MADV_LOCAL for specialised
> libraries may be more appropriate with the initial semantic being how it
> treats MADV_HUGEPAGE regions.
> 

Forget locality, the patch incurs a 40% fault regression when remote 
memory is fragmented as well.  Thrashing the local node will begin to 
thrash the remote nodes as well if they are balanced and fragmented in the 
same way.  This cannot possibly be in the best interest of the user.  What 
is in the best interest of the user is what can actually be gained by 
incurring that latency.

We talked about a "privileged" madvise mode that is even stronger with 
regard to locality before, which I don't think is needed.  If you have 
the privilege you could always use drop_caches, explicitly trigger 
MIGRATE_SYNC compaction yourself, speedup khugepaged in that background 
temporarily, etc. 

You could build upon my patch and say that compaction is beneficial, but 
we can still remove __GFP_THISNODE to get remote hugepages.  But let us 
please focus on the role that reclaim has when allocating a hugepage 
coupled with the implementation of the page allocator.

When a user reports this 40% fault latency and 13.9% access latency 
regressions if this patch is merged, what is the suggested fix?  Wait for 
another madvise mode and then patch your binary, assuming you can modify 
the source, and then ask for a privilege so it can stress reclaim locally?

> What you're asking for asking is an astonishing amount of work --
> a multi platform study against a wide variety of workloads with the
> addition that some test should be able to start in a fragmented state
> that is reproducible.
> 

No, I'm asking that you show in the implementation of the page allocator, 
memory compaction, and direct reclaim why the work done by reclaim (the 
zone thrashing and swapping) is proven beneficial that it leads to the 
allocation of a hugepage.  I'm not a fan of thrashing a zone pointlessly 
because we think the compaction freeing scanner may be able to allocate 
and the migration scanner may then suddenly free an entire pageblock.  If 
that feedback loop can be provided, reclaim sounds worthwhile.  Until 
then, we can't do such pointless work.

> What is being asked of you is to consider introducing a new madvise hint and
> having MADV_HUGEPAGE being about huge pages and introducing a new hint that
> is hinting about locality without the strictness of memory binding. That
> is significantly more tractable and you also presumably have access to
> a reasonable reproduction cases even though I doubt you can release it.
> 

We continue to talk about the role of __GFP_THISNODE.  The Andrea patch is 
obviously premised on the idea that for some reason remote memory is much 
less fragmented or not under memory pressure.  If the nodes are actually 
balanced, you've incurred a 40% fault latency absolutely and completely 
pointlessly.  What I'm focusing on is the role of reclaim, not 
__GFP_THISNODE.  Let's fix the issue being reported and look at the 
implementation of memory compaction to understand why it is happening: 
reclaim very very seldom can assist memory compaction in making a 
pageblock free under these conditions.  It's not beneficial and should not 
be done.

> > > Like Mel said, your app just happens to fit in a local node, if the
> > > user of the lib is slightly different and allocates 16G on a system
> > > where each node is 4G, the post-fix MADV_HUGEPAGE will perform
> > > extremely better also for the lib user.
> > > 
> > 
> > It won't if the remote memory is fragmented; the patch is premised on the 
> > argument that remote memory is never fragmented or under memory pressure 
> > otherwise it is multiplying the fault latency by the number of nodes.  
> > Sure, creating test cases where the local node is under heavy memory 
> > pressure yet remote nodes are mostly free will minimize the impact this 
> > has on fault latency.  It still is a substantial increase, as I've 
> > measured on Haswell, but the access latency forever is the penalty.  This 
> > patch cannot make access to remote memory faster.
> > 
> 
> Indeed not but remote costs are highly platform dependant so decisions
> made for haswell are not necessarily justified on EPYC or any later Intel
> processor either.

I'm reporting regressions for the Andrea patch on Haswell, Naples, and 
Rome.  If you have numbers for any platform that suggests remote huge 
memory (and of course there's no consideration of NUMA distance here, so 
2-hop would be allowed as well by this patch) is beneficial over local 
native pages, please present it.

> Because these goalposts will keep changing, I think
> it's better to have more clearly defined madvise hints and rely less on
> implementation details that are subject to change. Otherwise we will get
> stuck in an endless debate about "is workload X more important than Y"?
> 

The clearly defined madvise is not helpful if it causes massive thrashing 
of *any* zone on the system and cannot enable memory compaction to 
directly make a hugepage available.  In other words, show that I didn't 
just reclaim 60GB of memory that still causes compaction to fail to free a 
pageblock, which is precisely what happens today, either because slab 
fragmentation as the result of the memory pressure that we are under or 
because it is inaccessible to the freeing scanner.

> I don't think it's necessarily bad but it cannot distinguish between
> THP and hugetlbfs. Hugetlbfs users are typically more willing to accept
> high overheads as they may be required for the application to function.
> That's probably fixable but will still leave us in the state where
> MADV_HUGEPAGE is also a hint about locality. It'd still be interesting
> to hear if it fixes the VM initialisation issue but do note that if this
> patch is used as a replacement that hugetlbfs users may complain down
> the line.
> 

Hugetlb memory is allocated round-robin over the set of allowed nodes with 
__GFP_THISNODE so it will already result in this thrashing problem that is 
being reported with the very unlikely possibility that reclaim has helped 
us allocate a hugepage.

And this does distinguish between thp memory and hugetlb memory, thp is 
explicitly allocated here with __GFP_NORETRY, which controls the goto 
nopage, and hugetlb memory is allocated with __GFP_RETRY_MAYFAIL.
 
I'd hate to add a gfp flag unnecessarily for this since we're talking 
about the role of reclaim in assisting memory compaction to make a 
high-order page available.  That is clearly defined in my patch to be 
based on __GFP_NORETRY.

I'd appreciate if Andrea can test this patch, have a rebuttal that we 
should still remove __GFP_THISNODE because we don't care about locality as 
much as forming a hugepage, we can make that change, and then merge this 
instead of causing such massive fault and access latencies.
Andrea Arcangeli Oct. 9, 2018, 10:21 p.m. UTC | #18
On Mon, Oct 08, 2018 at 01:41:09PM -0700, David Rientjes wrote:
> The page allocator is expecting __GFP_NORETRY for thp allocations per its 
> comment:
> 
> 		/*
> 		 * Checks for costly allocations with __GFP_NORETRY, which
> 		 * includes THP page fault allocations
> 		 */
> 		if (costly_order && (gfp_mask & __GFP_NORETRY)) {
> 
> And that enables us to check compact_result to determine whether thp 
> allocation should fail or continue to reclaim.  I don't think it helps 
> that some thp fault allocations use __GFP_NORETRY and others do not.  I 
> think that deserves a fix to alloc_hugepage_direct_gfpmask() or 
> GFP_TRANSHUGE_LIGHT.

With your patch that changes how __GFP_NORETRY works for order =
pageblock_order then this can help. However it solves nothing for
large skbs and all other "costly_order" allocations. So if you think
it's so bad to have a 40% increased allocation latency if all nodes
are heavily fragmented for THP under MADV_HUGEPAGE (which are
guaranteed to be long lived allocations or they wouldn't set
MADV_HUGEPAGE in the first place), I'm not sure why you ignore the
same exact overhead for all other "costly_order" allocations that
would never set __GFP_THISNODE and will not even use
GFP_TRANSHUGE_LIGHT so they may not set __GFP_NORETRY at all.

I think it would be better to view the problem from a generic
"costly_order" allocation prospective without so much "hardcoded"
focus on THP MADV_HUGEPAGE only (which in fact are the least concern
of all in terms of that 40% latency increase if all nodes are totally
fragmented and compaction fails on all nodes, because they're long
lived so the impact of the increased latency is more likely to be lost
in the noise).

> Our library that uses MADV_HUGEPAGE only invokes direct compaction because 
> we're on an older kernel: it does not attempt to do reclaim to make 
> compaction happy so that it finds memory that it can migrate memory to.  
> For reference, we use defrag setting of "defer+madvise".  Everybody who 
> does not use MADV_HUGEPAGE kicks off kswapd/kcompactd and fails, 
> MADV_HUGEPAGE users do the same but also try direct compaction.  That 
> direct compaction uses a mode of MIGRATE_ASYNC so it normally fails 
> because of need_resched() or spinlock contention.
> These allocations always fail, MADV_HUGEPAGE or otherwise, without 
> invoking direct reclaim.

Even older kernels (before "defer+madvise" option was available)
always invoked reclaim if COMPACT_SKIPPED was returned. The
COMPACT_SKIPPED behavior I referred that explains why __GFP_NORETRY
doesn't prevent reclaim to be invoked, is nothing new, it always
worked that way from the day compaction was introduced.

So it's fairly strange that your kernel doesn't call reclaim at all if
COMPACT_SKIPPED is returned.

> I am agreeing with both you and Mel that it makes no sense to thrash the 
> local node to make compaction happy and then hugepage-order memory 
> available.  I'm only differing with you on the mechanism to fail early: we 
> never want to do attempt reclaim on thp allocations specifically because 
> it leads to the behavior you are addressing.

Not sure I can agree with the above.

If all nodes (not only the local one) are already below the watermarks
and compaction returns COMPACT_SKIPPED, there is zero global free
memory available, we would need to swap anyway to succeed the 2M
allocation. So it's better to reclaim 2M from on node and then retry
compaction again on the same node if compaction is failing on the node
because of COMPACT_SKIPPED and the global free memory is zero. If it
swapouts it would have swapped out anyway but this way THP will be
returned.

It's just __GFP_THISNODE that broke the above logic. __GFP_THISNODE
may just be safe to use only if the global amount of free memory (in
all nodes) is zero (or below HPAGE_PMD_SIZE generally speaking).

> My contention is that removing __GFP_THISNODE papers over the problem, 
> especially in cases where remote memory is also fragmnented. It incurs a 
> much higher (40%) fault latency and then incurs 13.9% greater access 
> latency.  It is not a good result, at least for Haswell, Naples, and Rome.
> 
> To make a case that we should fault hugepages remotely as fallback, either 
> for non-MADV_HUGEPAGE users who do not use direct compaction, or 
> MADV_HUGEPAGE users who use direct compaction, we need numbers that 
> suggest there is a performance benefit in terms of access latency to 
> suggest that it is better; this is especially the case when the fault 
> latency is 40% higher.  On Haswell, Naples, and Rome, it is quite obvious 
> that this patch works much harder to fault memory remotely that incurs a 
> substantial performance penalty when it fails and in the cases where it 
> succeeds we have much worse access latency.

What we're fixing is a effectively a breakage and insecure behavior
too and that must be fixed first. A heavily multithreaded processes
with its memory not fitting in a single NUMA node is too common of a
workload to misbehave so bad as it did. The above issue is a much
smaller concern and it's not a breakage nor insecure.

__GFP_THISNODE semantics are not enough to express what you need
there.

Like Mel said we need two tests, one that represents the "pathological
THP allocation behavior" which is trivial (any heavy multithreaded app
accessing all memory of all nodes from all CPUs will do). And one that
represents your workload.

By thinking about it, it's almost impossible that eliminating 100% of
swapouts from the multithreaded app runtime, will not bring a benefit
that is orders of magnitude bigger than the elimination of remote
compaction and having more local THP pages for your app.

It's just on a different scale and the old behavior was insecure to
begin with because it allowed a process to heavily penalize the
runtime of other apps running with different uid, without requiring
any privilege.

> For users who bind their applications to a subset of cores on a NUMA node, 
> fallback is egregious: we are guaranteed to never have local access 
> latency and in the case when the local node is fragmented and remote 
> memory is not our fault latency goes through the roof when local native 
> pages would have been much better.
> 
> I've brought numbers on how poor this patch performs, so I'm asking for a 
> rebuttal that suggests it is better on some platforms.  (1) On what 
> platforms is it better to fault remote hugepages over local native pages?  
> (2) What is the fault latency increase when remote nodes are fragmented as 
> well?

That I don't think is the primary issue, the issue here is the one of
the previous paragrah.

Then there may be workloads and/or platforms where allocating remote
THP instead of local PAGE_SIZEd memory is better regardless and
__GFP_THISNODE hurted twice (but we could ignore those for now, and
just focus where __GFP_THISNODE provided a benefit like for your app
using the lib).

> > Like Mel said, your app just happens to fit in a local node, if the
> > user of the lib is slightly different and allocates 16G on a system
> > where each node is 4G, the post-fix MADV_HUGEPAGE will perform
> > extremely better also for the lib user.
> > 
> 
> It won't if the remote memory is fragmented; the patch is premised on the 

How could compaction run slower than heavy swapping gigabytes of
memory?

> argument that remote memory is never fragmented or under memory pressure 
> otherwise it is multiplying the fault latency by the number of nodes.  
> Sure, creating test cases where the local node is under heavy memory 
> pressure yet remote nodes are mostly free will minimize the impact this 
> has on fault latency.  It still is a substantial increase, as I've 
> measured on Haswell, but the access latency forever is the penalty.  This 
> patch cannot make access to remote memory faster.

I still doubt you measured the performance of the app using the lib
when the app requires more 4 times the size of the local NUMA node
under "taskset -c 0". If you do that, the app using the lib shall fall
into the "pathological THP allocation behavior" too. As long as the
spillover is 10-20% perhaps it's not as bad and if it's not heavily
multithreaded doing scattered random access like a virtual machine
will do, NUMA balancing will hide the issue by moving the thread to
CPU of the NUMA node where the 10-20% spillover happened (which is why
I asked "taskset -c 0" to simulate a multithreaded app accessing all
memory in a scattered way without locality.

The multithreaded app with scattered access is not unreasonable
workload and it must perform well: all workloads using MPOL_INTERLEAVE
would fit the criteria, not just qemu.

> > And you know, if the lib user fits in one node, it can use mbind and
> > it won't hit OOM... and you'd need some capability giving the app
> > privilege anyway to keep MADV_HUGEPAGE as deep and unfair to the rest
> > of the processes running the local node (like mbind and mlock require
> > too).
> > 
> 
> No, the lib user does not fit into a single node, we have cases where 
> these nodes are under constant memory pressure.  We are on an older 
> kernel, however, that fails because of need_resched() and spinlock 
> contention rather than falling back to local reclaim.

It's unclear which need_resched() and spinlock contention induces this
failure.

> > Did you try the __GFP_COMPACT_ONLY patch? That won't have the 40%
> > fault latency already.
> > 
> 
> I remember Kirill saying that he preferred node local memory allocation 
> over thp allocation in the review, which I agree with, but it was much 
> better than this patch.  I see no reason why we should work so hard to 
> reclaim memory, thrash local memory, and swap so that the compaction 
> freeing scanner can find memory when there is *no* guarantee that the 
> migration scanner can free an entire pageblock.  That is completely 
> pointless work, even for an MADV_HUGEPAGE user, and we certainly don't 
> have such pathological behavior on older kernels.

The compaction scanner is succeeding at migrating the entire
pageblock every time. The local node isn't so fragmented, it's just
full of anonymous memory that is getting swapped out at every further
invocation of reclaim. If it was fragmented __GFP_NORETRY would have
had an effect because compaction would have been deferred.

> Completely agreed, and I'm very strongly suggesting that we do not invoke 
> reclaim.  Memory compaction provides no guarantee that it can free an 
> entire pageblock, even for MIGRATE_ASYNC compaction under MADV_HUGEPAGE it 
> will attempt to migrate SWAP_CLUSTER_MAX pages from a pageblock without 
> considering if the entire pageblock could eventually become freed for 
> order-9 memory.  The vast majority of the time this reclaim could be 
> completely pointless.  We cannot use it.

If we don't invoke reclaim __GFP_THISNODE is fine to keep, however
that means direction compaction doesn't work at all if all nodes are
full of filesystem cache. This was a drawback of __GFP_COMPACT_ONLY
too. A minor one compared to the corner case created by
__GFP_THISNODE.

> We don't need __GFP_NORETRY anymore with this, it wouldn't be useful to 
> continually compact memory in isolation if it already has failed.  It 
> seems strange to be using __GFP_DIRECT_RECLAIM | __GFP_ONLY_COMPACT, 
> though, as I assume this would evolve into.
> 
> Are there any other proposed users for __GFP_ONLY_COMPACT beyond thp 
> allocations?  If not, we should just save the gfp bit and encode the logic 
> directly into the page allocator.

Yes that's seems equivalent, just it embeds __GFP_COMPACT_ONLY in
__GFP_NORETRY and ends up setting __GFP_NORETRY for everything,
__GFP_COMPACT_ONLY didn't alter the __GFP_NORETRY semantics instead.

> Would you support this?

I didn't consider it a problem the 40% increased allocation latency, I
thought __GFP_COMPACT_ONLY or the other fix in -mm was just about the
remote THP vs local PAGE_SIZEd memory tradeoff (considering this logic
was already hardcoded only for THP so I tried to keep it).

Now that you raise this higher compaction latency as a big regression
in not fixing the VM with __GFP_COMPACT_ONLY semantics in
__GFP_NORETRY, the next question is why you don't care at all about
such increased latency for all other "costly_order" allocations that
won't use __GFP_THISNODE and where your altered __GFP_NORETRY won't
help to skip compaction on the remote nodes.

In other words if such compaction invocation in fully fragmented
remote nodes is such a big concern for THP, why is not an issue at all
for all other "costly_order" allocations?

> +			if (order == pageblock_order &&
> +					!(current->flags & PF_KTHREAD))
> +				goto nopage;

For a more reliable THP hardcoding HPAGE_PMD_ORDER would be better
than pageblock_order (I guess especially for non-x86 archs where the
match between hugetlbfs size and thp size is less obvious), but then
the hardcoding itself now looks the worst part of this patch if the
concern is the 40% increased allocation latency (and not only the
remote compound page vs local 4k page).

Thanks,
Andrea
Andrea Arcangeli Oct. 9, 2018, 10:51 p.m. UTC | #19
On Tue, Oct 09, 2018 at 03:17:30PM -0700, David Rientjes wrote:
> causes workloads to severely regress both in fault and access latency when 
> we know that direct reclaim is unlikely to make direct compaction free an 
> entire pageblock.  It's more likely than not that the reclaim was 
> pointless and the allocation will still fail.

How do you know that? If all RAM is full of filesystem cache, but it's
not heavily fragmented by slab or other unmovable objects, compaction
will succeed every single time after reclaim frees 2M of cache like
it's asked to do.

reclaim succeeds every time, compaction then succeeds every time.

Not doing reclaim after COMPACT_SKIPPED is returned simply makes
compaction unable to compact memory once all nodes are filled by
filesystem cache.

Certainly it's better not to invoke reclaim at all if __GFP_THISNODE
is set, than swapping out heavy over the local node. Doing so however
has the drawback of reducing the direct compaction effectiveness. I
don't think it's true that reclaim is generally "pointless", it's just
that invoking any reclaim backfired so bad if __GFP_THISNODE was set,
than anything else (including weakining compaction effectiveness) was
better.

> If memory compaction were patched such that it can report that it could 
> successfully free a page of the specified order if there were free pages 
> at the end of the zone it could migrate to, reclaim might be helpful.  But 
> with the current implementation, I don't think that is reliably possible.  
> These free pages could easily be skipped over by the migration scanner 
> because of the presence of slab pages, for example, and unavailable to the 
> freeing scanner.

Yes there's one case where reclaim is "pointless", but it happens once
and then COMPACT_DEFERRED is returned and __GFP_NORETRY will skip
reclaim then.

So you're right when we hit fragmentation there's one and only one
"pointless" reclaim invocation. And immediately after we also
exponentially backoff on the compaction invocations with the
compaction deferred logic.

We could try optimize away such "pointless" reclaim event for sure,
but it's probably an optimization that may just get lost in the noise
and may not be measurable, because it only happens once when the first
full fragmentation is encountered.

> I really think that for this patch to be merged over my proposed change 
> that it needs to be clearly demonstrated that reclaim was successful in 
> that it freed memory that was subsequently available to the compaction 
> freeing scanner and that enabled entire pageblocks to become free.  That, 
> in my experience, will very very seldom be successful because of internal 
> slab fragmentation, compaction_alloc() cannot soak up pages from the 
> reclaimed memory, and potentially thrash the zone completely pointlessly.  
> The last point is the problem being reported here, but the other two are 
> as legitimate.

I think the demonstration can already be inferred, because if hit full
memory fragmentation after every reclaim, __GFP_NORETRY would have
solved the "pathological THP allocation behavior" without requiring
your change to __GFP_NORETRY that makes it behave like
__GFP_COMPACT_ONLY for order == HPAGE_PMD_ORDER.

Anyway you can add a few statistic counters and verify in more
accurate way how often a COMPACT_SKIPPED + reclaim cycle is followed
by a COMPACT_DEFERRED.

> I'd appreciate if Andrea can test this patch, have a rebuttal that we 
> should still remove __GFP_THISNODE because we don't care about locality as 
> much as forming a hugepage, we can make that change, and then merge this 
> instead of causing such massive fault and access latencies.

I can certainly test, but from source review I'm already convinced
it'll solve fine the "pathological THP allocation behavior", no
argument about that. It's certainly better and more correct your patch
than the current upstream (no security issues with lack of permissions
for __GFP_THISNODE anymore either).

I expect your patch will run 100% equivalent to __GFP_COMPACT_ONLY
alternative I posted, for our testcase that hit into the "pathological
THP allocation behavior".

Your patch encodes __GFP_COMPACT_ONLY into the __GFP_NORETRY semantics
and hardcodes the __GFP_COMPACT_ONLY for all orders = HPAGE_PMD_SIZE
no matter which is the caller.

As opposed I let the caller choose and left __GFP_NORETRY semantics
alone and orthogonal to the __GFP_COMPACT_ONLY semantics. I think
letting the caller decide instead of hardcoding it for order 9 is
better, because __GFP_COMPACT_ONLY made sense to be set only if
__GFP_THISNODE was also set by the caller.

If a driver does an order 9 allocation with __GFP_THISNODE not set,
your patch will prevent it to allocate remote THP if all remote nodes
are full of cache (which is a reasonable common assumption as more THP
are allocated over time eating in all free memory). My patch didn't
alter that so I tend to prefer the __GFP_COMPACT_ONLY than the
hardcoding __GFP_COMPACT_ONLY for all order 9 allocations regardless
if __GFP_THISNODE is set or not.

Last but not the last, from another point of view, I thought calling
remote compaction was a feature especially for MADV_HUGEPAGE. However
if avoiding the 40% increase latency for MADV_HUGEPAGE was the primary
motivation for __GFP_THISNODE, not sure how we can say that such an
high allocation latency is only a concern for order 9 allocations and
all other costly_order allocations (potentially with orders even
higher than HPAGE_PMD_ORDER in fact) are ok to keep calling remote
compaction and incur in the 40% higher latency.

Thanks,
Andrea
Andrea Arcangeli Oct. 9, 2018, 11:03 p.m. UTC | #20
On Tue, Oct 09, 2018 at 04:25:10PM +0200, Michal Hocko wrote:
> On Tue 09-10-18 14:00:34, Mel Gorman wrote:
> > On Tue, Oct 09, 2018 at 02:27:45PM +0200, Michal Hocko wrote:
> > > [Sorry for being slow in responding but I was mostly offline last few
> > >  days]
> > > 
> > > On Tue 09-10-18 10:48:25, Mel Gorman wrote:
> > > [...]
> > > > This goes back to my point that the MADV_HUGEPAGE hint should not make
> > > > promises about locality and that introducing MADV_LOCAL for specialised
> > > > libraries may be more appropriate with the initial semantic being how it
> > > > treats MADV_HUGEPAGE regions.
> > > 
> > > I agree with your other points and not going to repeat them. I am not
> > > sure madvise s the best API for the purpose though. We are talking about
> > > memory policy here and there is an existing api for that so I would
> > > _prefer_ to reuse it for this purpose.
> > > 
> > 
> > I flip-flopped on that one in my head multiple times on the basis of
> > how strict it should be. Memory policies tend to be black or white --
> > bind here, interleave there, etc. It wasn't clear to me what the best
> > policy would be to describe "allocate local as best as you can but allow
> > fallbacks if necessary".

MPOL_PREFERRED is not black and white. In fact I asked David earlier
if MPOL_PREFERRED could check if it would already be a good fit for
this. Still the point is it requires privilege (and for a good
reason).

> I was thinking about MPOL_NODE_PROXIMITY with the following semantic:
> - try hard to allocate from a local or very close numa node(s) even when
> that requires expensive operations like the memory reclaim/compaction
> before falling back to other more distant numa nodes.

If MPOL_PREFERRED can't work something like this could be added.

I think "madvise vs mbind" is more an issue of "no-permission vs
permission" required. And if the processes ends up swapping out all
other process with their memory already allocated in the node, I think
some permission is correct to be required, in which case an mbind
looks a better fit. MPOL_PREFERRED also looks a first candidate for
investigation as it's already not black and white and allows spillover
and may already do the right thing in fact if set on top of
MADV_HUGEPAGE.

Thanks,
Andrea
Vlastimil Babka Oct. 10, 2018, 7:54 a.m. UTC | #21
On 10/10/18 12:51 AM, Andrea Arcangeli wrote:
> Yes there's one case where reclaim is "pointless", but it happens once
> and then COMPACT_DEFERRED is returned and __GFP_NORETRY will skip
> reclaim then.
> 
> So you're right when we hit fragmentation there's one and only one
> "pointless" reclaim invocation. And immediately after we also
> exponentially backoff on the compaction invocations with the
> compaction deferred logic.
> 
> We could try optimize away such "pointless" reclaim event for sure,
> but it's probably an optimization that may just get lost in the noise
> and may not be measurable, because it only happens once when the first
> full fragmentation is encountered.

Note there's a small catch in the above. defer_compaction() has always
only been called after a failure on higher priority than
COMPACT_PRIO_ASYNC, where it's assumed that async compaction can
terminate prematurely due to a number of reasons, so it doesn't mean
that the zone itself cannot be compacted.
And, for __GFP_NORETRY, if the initial compaction fails, we keep using
async compaction also for the second, after-reclaim attempt (which would
otherwise use SYNC_LIGHT):

        /*
         * Looks like reclaim/compaction is worth trying, but
         * sync compaction could be very expensive, so keep
         * using async compaction.
         */
        compact_priority = INIT_COMPACT_PRIORITY;

This doesn't affect current madvised THP allocation which doesn't use
__GFP_NORETRY, but could explain why you saw no benefit from changing it
to __GFP_NORETRY.
David Rientjes Oct. 10, 2018, 9 p.m. UTC | #22
On Tue, 9 Oct 2018, Andrea Arcangeli wrote:

> On Tue, Oct 09, 2018 at 03:17:30PM -0700, David Rientjes wrote:
> > causes workloads to severely regress both in fault and access latency when 
> > we know that direct reclaim is unlikely to make direct compaction free an 
> > entire pageblock.  It's more likely than not that the reclaim was 
> > pointless and the allocation will still fail.
> 
> How do you know that? If all RAM is full of filesystem cache, but it's
> not heavily fragmented by slab or other unmovable objects, compaction
> will succeed every single time after reclaim frees 2M of cache like
> it's asked to do.
> 
> reclaim succeeds every time, compaction then succeeds every time.
> 
> Not doing reclaim after COMPACT_SKIPPED is returned simply makes
> compaction unable to compact memory once all nodes are filled by
> filesystem cache.
> 

For reclaim to assist memory compaction based on compaction's current 
implementation, it would require that the freeing scanner starting at the 
end of the zone can find these reclaimed pages as migration targets and 
that compaction will be able to utilize these migration targets to make an 
entire pageblock free.

In such low on memory conditions when a node is fully saturated, it is 
much less likely that memory compaction can free an entire pageblock even 
if the freeing scanner can find these now-free pages.  More likely is that 
we have unmovable pages in MIGRATE_MOVABLE pageblocks because the 
allocator allows fallback to pageblocks of other migratetypes to return 
node local memory (because affinity matters for kernel memory as it 
matters for thp) rather than fallback to remote memory.

This has caused us significant pain where we have 1.5GB of slab, for 
example, spread over 100GB of pageblocks once the node has become 
saturated.

So reclaim is not always "pointless" as you point out, but it should at 
least only be attempted if memory compaction could free an entire 
pageblock if it had free memory to migrate to.  It's much harder to make 
sure that these freed pages can be utilized by the freeing scanner.  Based 
on how memory compaction is implemented, I do not think any guarantee can 
be made that reclaim will ever be successful in allowing it to make 
order-9 memory available, unfortunately.

> > If memory compaction were patched such that it can report that it could 
> > successfully free a page of the specified order if there were free pages 
> > at the end of the zone it could migrate to, reclaim might be helpful.  But 
> > with the current implementation, I don't think that is reliably possible.  
> > These free pages could easily be skipped over by the migration scanner 
> > because of the presence of slab pages, for example, and unavailable to the 
> > freeing scanner.
> 
> Yes there's one case where reclaim is "pointless", but it happens once
> and then COMPACT_DEFERRED is returned and __GFP_NORETRY will skip
> reclaim then.
> 
> So you're right when we hit fragmentation there's one and only one
> "pointless" reclaim invocation. And immediately after we also
> exponentially backoff on the compaction invocations with the
> compaction deferred logic.
> 

This assumes that every time we get COMPACT_SKIPPED that if we can simply 
free memory that it'll succeed and that's definitely not the case based on 
the implementation of memory compaction: compaction_alloc() needs to find 
the memory and the migration scanner needs to free an entire pageblock.  
The migration scanner doesn't even look ahead to see if that's possible 
before starting to migrate pages, it's limited to COMPACT_CLUSTER_MAX.

The scenario we have: compaction returns COMPACT_SKIPPED; reclaim 
expensively tries to reclaim memory by thrashing the local node; the 
compaction migration scanner has already passed over the now-freed pages 
so it's inaccessible; if accessible, the migration scanner migrates memory 
to the newly freed pages but fails to make a pageblock free; loop.

My contention is that the second step is only justified if we can 
guarantee the freed memory can be useful for compaction and that it can 
free an entire pageblock for the hugepage if it can migrate.  Both of 
those are not possible to determine based on the current implementation.

> > I'd appreciate if Andrea can test this patch, have a rebuttal that we 
> > should still remove __GFP_THISNODE because we don't care about locality as 
> > much as forming a hugepage, we can make that change, and then merge this 
> > instead of causing such massive fault and access latencies.
> 
> I can certainly test, but from source review I'm already convinced
> it'll solve fine the "pathological THP allocation behavior", no
> argument about that. It's certainly better and more correct your patch
> than the current upstream (no security issues with lack of permissions
> for __GFP_THISNODE anymore either).
> 
> I expect your patch will run 100% equivalent to __GFP_COMPACT_ONLY
> alternative I posted, for our testcase that hit into the "pathological
> THP allocation behavior".
> 
> Your patch encodes __GFP_COMPACT_ONLY into the __GFP_NORETRY semantics
> and hardcodes the __GFP_COMPACT_ONLY for all orders = HPAGE_PMD_SIZE
> no matter which is the caller.
> 
> As opposed I let the caller choose and left __GFP_NORETRY semantics
> alone and orthogonal to the __GFP_COMPACT_ONLY semantics. I think
> letting the caller decide instead of hardcoding it for order 9 is
> better, because __GFP_COMPACT_ONLY made sense to be set only if
> __GFP_THISNODE was also set by the caller.
> 

I've hardcoded it directly for pageblock_order because compaction works 
over pageblocks and we lack the two crucial points of information I've 
stated that determines whether direct reclaim could possibly be useful.  
(It's more correctly implemented as order >= pageblock_order as opposed 
to order == pageblock_order.)

> If a driver does an order 9 allocation with __GFP_THISNODE not set,
> your patch will prevent it to allocate remote THP if all remote nodes
> are full of cache (which is a reasonable common assumption as more THP
> are allocated over time eating in all free memory).

Iff it's using __GFP_NORETRY, yes, the allocation and remote access 
latency that is incurred is the same as for thp.  Hugetlbfs doesn't use 
__GFP_NORETRY, it uses __GFP_RETRY_MAYFAIL, so it will attempt to reclaim 
memory after my patch.

My patch is assuming that a single call to reclaim followed up by another 
attempt at compaction is not beneficial for pageblock_order sized 
allocations with __GFP_NORETRY because it's unlikely to either free an 
entire pageblock and compaction may not be able to access this memory.  
It's based on how memory compaction is implemented rather than any special 
heuristic.

I won't argue if you protect this logic with __GFP_COMPACT_ONLY, but I 
think thp allocations should always have __GFP_NORETRY based on 
compaction's implementation.

I see removing __GFP_THISNODE as a separate discussion: if, after my patch 
(perhaps with a modification for __GFP_COMPACT_ONLY on top of it), you 
still get unacceptable fault latency and can show that remote access 
latency to remotely allocated hugepage is better on some platform that 
isn't Haswell, Naples, and Rome, we can address that but it will probably 
require more work that simply unsetting __GFP_THISNODE because it will 
depend on the latency to certain remote nodes over others.
David Rientjes Oct. 10, 2018, 9:19 p.m. UTC | #23
On Tue, 9 Oct 2018, Andrea Arcangeli wrote:

> I think "madvise vs mbind" is more an issue of "no-permission vs
> permission" required. And if the processes ends up swapping out all
> other process with their memory already allocated in the node, I think
> some permission is correct to be required, in which case an mbind
> looks a better fit. MPOL_PREFERRED also looks a first candidate for
> investigation as it's already not black and white and allows spillover
> and may already do the right thing in fact if set on top of
> MADV_HUGEPAGE.
> 

We would never want to thrash the local node for hugepages because there 
is no guarantee that any swapping is useful.  On COMPACT_SKIPPED due to 
low memory, we have very clear evidence that pageblocks are already 
sufficiently fragmented by unmovable pages such that compaction itself, 
even with abundant free memory, fails to free an entire pageblock due to 
the allocator's preference to fragment pageblocks of fallback migratetypes 
over returning remote free memory.

As I've stated, we do not want to reclaim pointlessly when compaction is 
unable to access the freed memory or there is no guarantee it can free an 
entire pageblock.  Doing so allows thrashing of the local node, or remote 
nodes if __GFP_THISNODE is removed, and the hugepage still cannot be 
allocated.  If this proposed mbind() that requires permissions is geared 
to me as the user, I'm afraid the details of what leads to the thrashing 
are not well understood because I certainly would never use this.
David Rientjes Oct. 15, 2018, 10:30 p.m. UTC | #24
On Wed, 10 Oct 2018, David Rientjes wrote:

> > I think "madvise vs mbind" is more an issue of "no-permission vs
> > permission" required. And if the processes ends up swapping out all
> > other process with their memory already allocated in the node, I think
> > some permission is correct to be required, in which case an mbind
> > looks a better fit. MPOL_PREFERRED also looks a first candidate for
> > investigation as it's already not black and white and allows spillover
> > and may already do the right thing in fact if set on top of
> > MADV_HUGEPAGE.
> > 
> 
> We would never want to thrash the local node for hugepages because there 
> is no guarantee that any swapping is useful.  On COMPACT_SKIPPED due to 
> low memory, we have very clear evidence that pageblocks are already 
> sufficiently fragmented by unmovable pages such that compaction itself, 
> even with abundant free memory, fails to free an entire pageblock due to 
> the allocator's preference to fragment pageblocks of fallback migratetypes 
> over returning remote free memory.
> 
> As I've stated, we do not want to reclaim pointlessly when compaction is 
> unable to access the freed memory or there is no guarantee it can free an 
> entire pageblock.  Doing so allows thrashing of the local node, or remote 
> nodes if __GFP_THISNODE is removed, and the hugepage still cannot be 
> allocated.  If this proposed mbind() that requires permissions is geared 
> to me as the user, I'm afraid the details of what leads to the thrashing 
> are not well understood because I certainly would never use this.
> 

At the risk of beating a dead horse that has already been beaten, what are 
the plans for this patch when the merge window opens?  It would be rather 
unfortunate for us to start incurring a 14% increase in access latency and 
40% increase in fault latency.  Would it be possible to test with my 
patch[*] that does not try reclaim to address the thrashing issue?  If 
that is satisfactory, I don't have a strong preference if it is done with 
a hardcoded pageblock_order and __GFP_NORETRY check or a new 
__GFP_COMPACT_ONLY flag.

I think the second issue of faulting remote thp by removing __GFP_THISNODE 
needs supporting evidence that shows some platforms benefit from this (and 
not with numa=fake on the command line :).

 [*] https://marc.info/?l=linux-kernel&m=153903127717471
Andrew Morton Oct. 15, 2018, 10:44 p.m. UTC | #25
On Mon, 15 Oct 2018 15:30:17 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:

> At the risk of beating a dead horse that has already been beaten, what are 
> the plans for this patch when the merge window opens?

I'll hold onto it until we've settled on something.  Worst case,
Andrea's original is easily backportable.

>  It would be rather 
> unfortunate for us to start incurring a 14% increase in access latency and 
> 40% increase in fault latency.

Yes.

>  Would it be possible to test with my 
> patch[*] that does not try reclaim to address the thrashing issue?

Yes please.

And have you been able to test it with the sort of workloads which
Andrea is attempting to address?
Andrea Arcangeli Oct. 15, 2018, 10:57 p.m. UTC | #26
On Mon, Oct 15, 2018 at 03:30:17PM -0700, David Rientjes wrote:
> At the risk of beating a dead horse that has already been beaten, what are 
> the plans for this patch when the merge window opens?  It would be rather 
> unfortunate for us to start incurring a 14% increase in access latency and 
> 40% increase in fault latency.  Would it be possible to test with my 
> patch[*] that does not try reclaim to address the thrashing issue?  If 
> that is satisfactory, I don't have a strong preference if it is done with 
> a hardcoded pageblock_order and __GFP_NORETRY check or a new 
> __GFP_COMPACT_ONLY flag.

I don't like the pageblock size hardcoding inside the page
allocator. __GFP_COMPACT_ONLY is fully runtime equivalent, but it at
least let the caller choose the behavior, so it looks more flexible.

To fix your 40% fault latency concern in the short term we could use
__GFP_COMPACT_ONLY, but I think we should get rid of
__GFP_COMPACT_ONLY later: we need more statistical data in the zone
structure to track remote compaction failures triggering because the
zone is fully fragmented.

Once the zone is fully fragmented we need to do a special exponential
backoff on that zone when the zone is from a remote node.

Furthermore at the first remote NUMA node zone failure caused by full
fragmentation we need to interrupt compaction and stop trying with all
remote nodes.

As long as compaction returns COMPACT_SKIPPED it's ok to keep doing
reclaim and keep doing compaction, as long as compaction succeeds.

What is causing the higher latency is the fact we try all zones from
all remote nodes even if there's a failure caused by full
fragmentation of all remote zone, and we're unable to skip (skip with
exponential backoff) only those zones where compaction cannot succeed
because of fragmentation.

Once we achieve the above deleting __GFP_COMPACT_ONLY will be a
trivial patch.

> I think the second issue of faulting remote thp by removing __GFP_THISNODE 
> needs supporting evidence that shows some platforms benefit from this (and 
> not with numa=fake on the command line :).
> 
>  [*] https://marc.info/?l=linux-kernel&m=153903127717471

That is needed to compare the current one liner fix with
__GFP_COMPACT_ONLY, but I don't think it's needed to compare v4.18
with the current fix. The badness of v4.18 was too bad keep, getting
local PAGE_SIZEd memory or remote THPs is a secondary issue.

In fact the main reason for __GFP_COMPACT_ONLY is not anymore such
tradeoff, but not to spend too much CPU in compaction when all nodes
are fragmented to avoid increasing the allocation latency too much.

Thanks,
Andrea
Andrea Arcangeli Oct. 15, 2018, 11:19 p.m. UTC | #27
Hello Andrew,

On Mon, Oct 15, 2018 at 03:44:59PM -0700, Andrew Morton wrote:
> On Mon, 15 Oct 2018 15:30:17 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:
> >  Would it be possible to test with my 
> > patch[*] that does not try reclaim to address the thrashing issue?
> 
> Yes please.

It'd also be great if a testcase reproducing the 40% higher access
latency (with the one liner original fix) was available.

We don't have a testcase for David's 40% latency increase problem, but
that's likely to only happen when the system is somewhat low on memory
globally. So the measurement must be done when compaction starts
failing globally on all zones, but before the system starts
swapping. The more global fragmentation the larger will be the window
between "compaction fails because all zones are too fragmented" and
"there is still free PAGE_SIZEd memory available to reclaim without
swapping it out". If I understood correctly, that is precisely the
window where the 40% higher latency should materialize.

The workload that shows the badness in the upstream code is fairly
trivial. Mel and Zi reproduced it too and I have two testcases that
can reproduce it, one with device assignment and the other is just
memhog. That's a massively larger window than the one where the 40%
higher latency materializes.

When there's 75% or more of the RAM free (not even allocated as easily
reclaimable pagecache) globally, you don't expect to hit heavy
swapping.

The 40% THP allocation latency increase if you use MADV_HUGEPAGE in
such window where all remote zones are fully fragmented is somehow
lesser of a concern in my view (plus there's the compact deferred
logic that should mitigate that scenario). Furthermore it is only a
concern for page faults in MADV_HUGEPAGE ranges. If MADV_HUGEPAGE is
set the userland allocation is long lived, so such higher allocation
latency won't risk to hit short lived allocations that don't set
MADV_HUGEPAGE (unless madvise=always, but that's not the default
precisely because not all allocations are long lived).

If the MADV_HUGEPAGE using library was freely available it'd also be
nice.
Mel Gorman Oct. 16, 2018, 7:46 a.m. UTC | #28
On Mon, Oct 15, 2018 at 03:44:59PM -0700, Andrew Morton wrote:
> On Mon, 15 Oct 2018 15:30:17 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:
> 
> > At the risk of beating a dead horse that has already been beaten, what are 
> > the plans for this patch when the merge window opens?
> 
> I'll hold onto it until we've settled on something.  Worst case,
> Andrea's original is easily backportable.
> 

I consider this to be an unfortunate outcome. On the one hand, we have a
problem that three people can trivially reproduce with known test cases
and a patch shown to resolve the problem. Two of those three people work
on distributions that are exposed to a large number of users. On the
other, we have a problem that requires the system to be in a specific
state and an unknown workload that suffers badly from the remote access
penalties with a patch that has review concerns and has not been proven
to resolve the trivial cases. In the case of distributions, the first
patch addresses concerns with a common workload where on the other hand
we have an internal workload of a single company that is affected --
which indirectly affects many users admittedly but only one entity directly.

At the absolute minimum, a test case for the "system fragmentation incurs
access penalties for a workload" scenario that could both replicate the
fragmentation and demonstrate the problem should have been available before
the patch was rejected.  With the test case, there would be a chance that
others could analyse the problem and prototype some fixes. The test case
was requested in the thread and never produced so even if someone were to
prototype fixes, it would be dependant on a third party to test and produce
data which is a time-consuming loop. Instead, we are more or less in limbo.
Andrew Morton Oct. 16, 2018, 10:37 p.m. UTC | #29
On Tue, 16 Oct 2018 08:46:06 +0100 Mel Gorman <mgorman@suse.de> wrote:

> On Mon, Oct 15, 2018 at 03:44:59PM -0700, Andrew Morton wrote:
> > On Mon, 15 Oct 2018 15:30:17 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:
> > 
> > > At the risk of beating a dead horse that has already been beaten, what are 
> > > the plans for this patch when the merge window opens?
> > 
> > I'll hold onto it until we've settled on something.  Worst case,
> > Andrea's original is easily backportable.
> > 
> 
> I consider this to be an unfortunate outcome. On the one hand, we have a
> problem that three people can trivially reproduce with known test cases
> and a patch shown to resolve the problem. Two of those three people work
> on distributions that are exposed to a large number of users. On the
> other, we have a problem that requires the system to be in a specific
> state and an unknown workload that suffers badly from the remote access
> penalties with a patch that has review concerns and has not been proven
> to resolve the trivial cases. In the case of distributions, the first
> patch addresses concerns with a common workload where on the other hand
> we have an internal workload of a single company that is affected --
> which indirectly affects many users admittedly but only one entity directly.
> 
> At the absolute minimum, a test case for the "system fragmentation incurs
> access penalties for a workload" scenario that could both replicate the
> fragmentation and demonstrate the problem should have been available before
> the patch was rejected.  With the test case, there would be a chance that
> others could analyse the problem and prototype some fixes. The test case
> was requested in the thread and never produced so even if someone were to
> prototype fixes, it would be dependant on a third party to test and produce
> data which is a time-consuming loop. Instead, we are more or less in limbo.
> 

OK, thanks.

But we're OK holding off for a few weeks, yes?  If we do that
we'll still make it into 4.19.1.  Am reluctant to merge this while
discussion, testing and possibly more development are ongoing.
Andrea Arcangeli Oct. 16, 2018, 11:11 p.m. UTC | #30
Hello,

On Tue, Oct 16, 2018 at 03:37:15PM -0700, Andrew Morton wrote:
> we'll still make it into 4.19.1.  Am reluctant to merge this while
> discussion, testing and possibly more development are ongoing.

I think there can be definitely more developments primarily to make
the compact deferred logic NUMA aware. Instead of a global deferred
logic, we should split it per zone per node so that it backs off
exponentially with an higher cap in remote nodes. The current global
"backoff" limit will still apply to the "local" zone compaction. Who
would like to work on that?

However I don't think it's worth waiting for that, because it's not a
trivial change.

Certainly we can't ship upstream in production with this bug, so if it
doesn't get fixed upstream we'll fix it downstream first until the
more developments are production ready. This was a severe regression
compared to previous kernels that made important workloads unusable
and it starts when __GFP_THISNODE was added to THP allocations under
MADV_HUGEPAGE. It is not a significant risk to go to the previous
behavior before __GFP_THISNODE was added, it worked like that for
years.

This was simply an optimization to some lucky workloads that can fit
in a single node, but it ended up breaking the VM for others that
can't possibly fit in a single node, so going back is safe.

Thanks,
Andrea
Andrew Morton Oct. 16, 2018, 11:16 p.m. UTC | #31
On Tue, 16 Oct 2018 19:11:49 -0400 Andrea Arcangeli <aarcange@redhat.com> wrote:

> This was a severe regression
> compared to previous kernels that made important workloads unusable
> and it starts when __GFP_THISNODE was added to THP allocations under
> MADV_HUGEPAGE. It is not a significant risk to go to the previous
> behavior before __GFP_THISNODE was added, it worked like that for
> years.

5265047ac301 ("mm, thp: really limit transparent hugepage allocation to
local node") was April 2015.  That's a long time for a "severe
regression" to go unnoticed?
Michal Hocko Oct. 17, 2018, 7:08 a.m. UTC | #32
On Tue 16-10-18 16:16:43, Andrew Morton wrote:
> On Tue, 16 Oct 2018 19:11:49 -0400 Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > This was a severe regression
> > compared to previous kernels that made important workloads unusable
> > and it starts when __GFP_THISNODE was added to THP allocations under
> > MADV_HUGEPAGE. It is not a significant risk to go to the previous
> > behavior before __GFP_THISNODE was added, it worked like that for
> > years.@s1@s2@s1
> 
> 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to
> local node") was April 2015.  That's a long time for a "severe
> regression" to go unnoticed?

Well, it gets some time to adopt changes in enterprise and we start
seeing people reporting this issue. That is why I believe we should
start with something really simple and stable tree backportable first
and then build something more complex on top.
Mel Gorman Oct. 17, 2018, 9 a.m. UTC | #33
On Tue, Oct 16, 2018 at 03:37:15PM -0700, Andrew Morton wrote:
> On Tue, 16 Oct 2018 08:46:06 +0100 Mel Gorman <mgorman@suse.de> wrote:
> > I consider this to be an unfortunate outcome. On the one hand, we have a
> > problem that three people can trivially reproduce with known test cases
> > and a patch shown to resolve the problem. Two of those three people work
> > on distributions that are exposed to a large number of users. On the
> > other, we have a problem that requires the system to be in a specific
> > state and an unknown workload that suffers badly from the remote access
> > penalties with a patch that has review concerns and has not been proven
> > to resolve the trivial cases. In the case of distributions, the first
> > patch addresses concerns with a common workload where on the other hand
> > we have an internal workload of a single company that is affected --
> > which indirectly affects many users admittedly but only one entity directly.
> > 
> > At the absolute minimum, a test case for the "system fragmentation incurs
> > access penalties for a workload" scenario that could both replicate the
> > fragmentation and demonstrate the problem should have been available before
> > the patch was rejected.  With the test case, there would be a chance that
> > others could analyse the problem and prototype some fixes. The test case
> > was requested in the thread and never produced so even if someone were to
> > prototype fixes, it would be dependant on a third party to test and produce
> > data which is a time-consuming loop. Instead, we are more or less in limbo.
> > 
> 
> OK, thanks.
> 
> But we're OK holding off for a few weeks, yes?  If we do that
> we'll still make it into 4.19.1.  Am reluctant to merge this while
> discussion, testing and possibly more development are ongoing.
> 

Without a test case that reproduces the Google case, we are a bit stuck.
Previous experience indicates that just fragmenting memory is not enough
to give a reliable case as unless the unmovable/reclaimable pages are
"sticky", the normal reclaim can handle it. Similarly, the access
pattern of the target workload is important as it would need to be
something larger than L3 cache to constantly hit the access penalty. We
do not know what the exact characteristics of the Google workload are
but we know that a fix for three cases is not equivalent for the Google
case.

The discussion has circled around wish-list items such as better
fragmentation control, node-aware compaction, improved compact deferred
logic and lower latencies with little in the way of actual specifics
of implementation or patches. Improving fragmentation control would
benefit from a workload that actually fragments so the extfrag events
can be monitored as well as maybe a dump of pageblocks with mixed pages.

On node-aware compaction, that was not implemented initially simply
because HighMem was common and that needs to be treated as a corner case
-- we cannot safely migrate pages from zone normal to highmem. That one
is relatively trivial to measure as it's a functional issue.

However, backing off compaction properly to maximise allocation success
rates while minimising allocation latency and access latency needs a live
workload that is representative. Trivial cases like the java workloads,
nas or usemem won't do as they either exhibit special locality or are
streaming readers/writers. Memcache might work but the driver in that
case is critical to ensure the access penalties are incurred. Again,
a modern example is missing.

As for why this took so long to discover, it is highly likely that it's
due to VM's being sized such as they typically fit in a NUMA node so
it would have avoided the worst case scenarios. Furthermore, a machine
dedicated to VM's has fewer concerns with respect to slab allocations
and unmovable allocations fragmenting memory long-term. Finally, the
worst case scenarios are encountered when there is a mix of different
workloads of variable duration which may be common in a Google-like setup
with different jobs being dispatched across a large network but less so
in other setups where a service tends to be persistent. We already know
that some of the worst performance problems take years to discover.
David Rientjes Oct. 22, 2018, 8:45 p.m. UTC | #34
On Mon, 15 Oct 2018, Andrea Arcangeli wrote:

> > At the risk of beating a dead horse that has already been beaten, what are 
> > the plans for this patch when the merge window opens?  It would be rather 
> > unfortunate for us to start incurring a 14% increase in access latency and 
> > 40% increase in fault latency.  Would it be possible to test with my 
> > patch[*] that does not try reclaim to address the thrashing issue?  If 
> > that is satisfactory, I don't have a strong preference if it is done with 
> > a hardcoded pageblock_order and __GFP_NORETRY check or a new 
> > __GFP_COMPACT_ONLY flag.
> 
> I don't like the pageblock size hardcoding inside the page
> allocator. __GFP_COMPACT_ONLY is fully runtime equivalent, but it at
> least let the caller choose the behavior, so it looks more flexible.
> 

I'm not sure that I understand why the user would ever want to thrash 
their zone(s) for allocations of this order.  The problem here is 
specifically related to an entire pageblock becoming freeable and the 
unlikeliness that reclaiming/swapping/thrashing will assist memory 
compaction in making that happen.  For this reason, I think the 
order >= pageblock_order check is reasonable because it depends on the 
implementation of memory compaction.

Why do we need another gfp flag for thp allocations when they are made to 
be __GFP_NORETRY by default and it is very unlikely that reclaiming once 
and then retrying compaction is going to make an entire pageblock free?

I'd like to know (1) how continuous reclaim activity can make entire 
pageblocks freeable without thrashing and (2) the metrics that we can use 
to determine when it is worthwhile vs harmful.  I don't believe (1) is 
ever helpful based on the implementation of memory compaction and we lack 
(2) since reclaim is not targeted to memory that compaction can use.

> As long as compaction returns COMPACT_SKIPPED it's ok to keep doing
> reclaim and keep doing compaction, as long as compaction succeeds.
> 

Compaction will operate on 32 pages at a time and declare success each 
time and then pick up where it left off the next time it is called in the 
hope that it "succeeds" 512/32=16 times in a row while constantly 
reclaiming memory.  Even a single slab page in that pageblock will make 
all of this work useless.  Reclaimed memory not accessible by the freeing 
scanner will make its work useless.  We lack the ability to determine when 
compaction is successful in freeing a full pageblock.
David Rientjes Oct. 22, 2018, 8:54 p.m. UTC | #35
On Mon, 15 Oct 2018, Andrea Arcangeli wrote:

> > On Mon, 15 Oct 2018 15:30:17 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:
> > >  Would it be possible to test with my 
> > > patch[*] that does not try reclaim to address the thrashing issue?
> > 
> > Yes please.
> 
> It'd also be great if a testcase reproducing the 40% higher access
> latency (with the one liner original fix) was available.
> 

I never said 40% higher access latency, I said 40% higher fault latency.  

The higher access latency is 13.9% as measured on Haswell.

The test case is rather trivial: fragment all memory with order-4 memory 
to replicate a fragmented local zone, use sched_setaffinity() to bind to 
that node, and fault a reasonable number of hugepages (128MB, 256, 
whatever).  The cost of faulting remotely in this case was measured to be 
40% higher than falling back to local small pages.  This occurs quite 
obviously because you are thrashing the remote node trying to allocate 
thp.

> We don't have a testcase for David's 40% latency increase problem, but
> that's likely to only happen when the system is somewhat low on memory
> globally.

Well, yes, but that's most of our systems.  We can't keep around gigabytes 
of memory free just to work around this patch.  Removing __GFP_THISNODE to 
avoid thrashing the local node obviously will incur a substantial 
performance degradation if you thrash the remote node as well.  This 
should be rather straight forward.

> When there's 75% or more of the RAM free (not even allocated as easily
> reclaimable pagecache) globally, you don't expect to hit heavy
> swapping.
> 

I agree there is no regression introduced by your patch when 75% of memory 
is free.

> The 40% THP allocation latency increase if you use MADV_HUGEPAGE in
> such window where all remote zones are fully fragmented is somehow
> lesser of a concern in my view (plus there's the compact deferred
> logic that should mitigate that scenario). Furthermore it is only a
> concern for page faults in MADV_HUGEPAGE ranges. If MADV_HUGEPAGE is
> set the userland allocation is long lived, so such higher allocation
> latency won't risk to hit short lived allocations that don't set
> MADV_HUGEPAGE (unless madvise=always, but that's not the default
> precisely because not all allocations are long lived).
> 
> If the MADV_HUGEPAGE using library was freely available it'd also be
> nice.
> 

You scan your mappings for .text segments, map a hugepage-aligned region 
sufficient in size, mremap() to that region, and do MADV_HUGEPAGE.
David Rientjes Oct. 22, 2018, 9:04 p.m. UTC | #36
On Tue, 16 Oct 2018, Mel Gorman wrote:

> I consider this to be an unfortunate outcome. On the one hand, we have a
> problem that three people can trivially reproduce with known test cases
> and a patch shown to resolve the problem. Two of those three people work
> on distributions that are exposed to a large number of users. On the
> other, we have a problem that requires the system to be in a specific
> state and an unknown workload that suffers badly from the remote access
> penalties with a patch that has review concerns and has not been proven
> to resolve the trivial cases.

The specific state is that remote memory is fragmented as well, this is 
not atypical.  Removing __GFP_THISNODE to avoid thrashing a zone will only 
be beneficial when you can allocate remotely instead.  When you cannot 
allocate remotely instead, you've made the problem much worse for 
something that should be __GFP_NORETRY in the first place (and was for 
years) and should never thrash.

I'm not interested in patches that require remote nodes to have an 
abundance of free or unfragmented memory to avoid regressing.

> In the case of distributions, the first
> patch addresses concerns with a common workload where on the other hand
> we have an internal workload of a single company that is affected --
> which indirectly affects many users admittedly but only one entity directly.
> 

The alternative, which is my patch, hasn't been tested or shown why it 
cannot work.  We continue to talk about order >= pageblock_order vs
__GFP_COMPACTONLY.

I'd like to know, specifically:

 - what measurable affect my patch has that is better solved with removing
   __GFP_THISNODE on systems where remote memory is also fragmented?

 - what platforms benefit from remote access to hugepages vs accessing
   local small pages (I've asked this maybe 4 or 5 times now)?

 - how is reclaiming (and possibly thrashing) memory helpful if compaction
   fails to free an entire pageblock due to slab fragmentation due to low
   on memory conditions and the page allocator preference to return node-
   local memory?

 - how is reclaiming (and possibly thrashing) memory helpful if compaction
   cannot access the memory reclaimed because the freeing scanner has 
   already passed by it, or the migration scanner has passed by it, since
   this reclaim is not targeted to pages it can find?

 - what metrics can be introduced to the page allocator so that we can
   determine that reclaiming (and possibly thrashing) memory will result 
   in a hugepage being allocated?

Until we have answers, especially for the last, there is no reason why thp 
allocations should not be __GFP_NORETRY including for MADV_HUGEPAGE 
regions.  The implementation of memory compaction simply cannot guarantee 
that the cost is worthwhile.
Zi Yan Oct. 23, 2018, 1:27 a.m. UTC | #37
Hi David,

On 22 Oct 2018, at 17:04, David Rientjes wrote:

> On Tue, 16 Oct 2018, Mel Gorman wrote:
>
>> I consider this to be an unfortunate outcome. On the one hand, we 
>> have a
>> problem that three people can trivially reproduce with known test 
>> cases
>> and a patch shown to resolve the problem. Two of those three people 
>> work
>> on distributions that are exposed to a large number of users. On the
>> other, we have a problem that requires the system to be in a specific
>> state and an unknown workload that suffers badly from the remote 
>> access
>> penalties with a patch that has review concerns and has not been 
>> proven
>> to resolve the trivial cases.
>
> The specific state is that remote memory is fragmented as well, this 
> is
> not atypical.  Removing __GFP_THISNODE to avoid thrashing a zone will 
> only
> be beneficial when you can allocate remotely instead.  When you cannot
> allocate remotely instead, you've made the problem much worse for
> something that should be __GFP_NORETRY in the first place (and was for
> years) and should never thrash.
>
> I'm not interested in patches that require remote nodes to have an
> abundance of free or unfragmented memory to avoid regressing.

I just wonder what is the page allocation priority list in your 
environment,
assuming all memory nodes are so fragmented that no huge pages can be
obtained without compaction or reclaim.

Here is my version of that list, please let me know if it makes sense to 
you:

1. local huge pages: with compaction and/or page reclaim, you are 
willing
to pay the penalty of getting huge pages;

2. local base pages: since, in your system, remote data accesses have 
much
higher penalty than the extra TLB misses incurred by the base page size;

3. remote huge pages: at least it is better than remote base pages;

4. remote base pages: it performs worst in terms of locality and TLBs.

This might not be easy to implement in current kernel, because
the zones from remote nodes will always be candidates when
kernel is trying get_page_from_freelist(). Only _GFP_THISNODE
and MPOL_BIND can eliminate these remote node zones, where _GFP_THISNODE
is a kernel version MPOL_BIND and overwrites any user space
memory policy other than MPOL_BIND, which is troublesome.

In addition, to prioritize local base pages over remote pages,
the original huge page allocation has to fail, then kernel can
fall back to base page allocations. And you will never get remote
huge pages any more if the local base page allocation fails,
because there is no way back to huge page allocation after the fallback.

Do you expect both behaviors?


>> In the case of distributions, the first
>> patch addresses concerns with a common workload where on the other 
>> hand
>> we have an internal workload of a single company that is affected --
>> which indirectly affects many users admittedly but only one entity 
>> directly.
>>
>
> The alternative, which is my patch, hasn't been tested or shown why it
> cannot work.  We continue to talk about order >= pageblock_order vs
> __GFP_COMPACTONLY.
>
> I'd like to know, specifically:
>
>  - what measurable affect my patch has that is better solved with 
> removing
>    __GFP_THISNODE on systems where remote memory is also fragmented?
>
>  - what platforms benefit from remote access to hugepages vs accessing
>    local small pages (I've asked this maybe 4 or 5 times now)?
>
>  - how is reclaiming (and possibly thrashing) memory helpful if 
> compaction
>    fails to free an entire pageblock due to slab fragmentation due to 
> low
>    on memory conditions and the page allocator preference to return 
> node-
>    local memory?
>
>  - how is reclaiming (and possibly thrashing) memory helpful if 
> compaction
>    cannot access the memory reclaimed because the freeing scanner has
>    already passed by it, or the migration scanner has passed by it, 
> since
>    this reclaim is not targeted to pages it can find?
>
>  - what metrics can be introduced to the page allocator so that we can
>    determine that reclaiming (and possibly thrashing) memory will 
> result
>    in a hugepage being allocated?

The slab fragmentation and whether reclaim/compaction can help form
huge pages seem to orthogonal to this patch, which tries to decide
the priority between locality and huge pages.

For slab fragmentation, you might find this paper “Making Huge Pages 
Actually Useful”
(https://dl.acm.org/citation.cfm?id=3173203) helpful. The paper is
trying to minimize the number of page blocks that have both moveable and
non-moveable pages.


--
Best Regards
Yan Zi
Mel Gorman Oct. 23, 2018, 7:57 a.m. UTC | #38
On Mon, Oct 22, 2018 at 02:04:32PM -0700, David Rientjes wrote:
> On Tue, 16 Oct 2018, Mel Gorman wrote:
> 
> > I consider this to be an unfortunate outcome. On the one hand, we have a
> > problem that three people can trivially reproduce with known test cases
> > and a patch shown to resolve the problem. Two of those three people work
> > on distributions that are exposed to a large number of users. On the
> > other, we have a problem that requires the system to be in a specific
> > state and an unknown workload that suffers badly from the remote access
> > penalties with a patch that has review concerns and has not been proven
> > to resolve the trivial cases.
> 
> The specific state is that remote memory is fragmented as well, this is 
> not atypical. 

While not necessarily atypical, *how* it gets fragmented is important.
The final state of fragmentation depends on both the input allocation
stream *and* the liveness of the mobility of unmovable pages. This is
why a target workload is crucial. While it's trivial to fragment memory,
it can be in a state that may or may not be trivially compacted.

For example, a fragmenting workload that interleaves slab allocations
with page cache may fragment memory but if the files are not being used
at the time of a THP allocation then they are trivially
reclaimed/compacted. However, if the files are in active use, the system
remains active.

You make the following claim in another response

	The test case is rather trivial: fragment all memory with order-4
	memory to replicate a fragmented local zone, use sched_setaffinity()
	to bind to that node, and fault a reasonable number of hugepages
	(128MB, 256, whatever).

You do not describe why it's order-4 allocations that are fragmenting or why
they remain live. order-4 might imply large jumbo frame allocations but they
typically do not remain live for very long unless you are using a driver
that recycles large numbers of order-4 pages. It's a critical detail. If
your test case is trivial then post it so there is a common base to work
from. A test case has been requested from you multiple times already.

Note that I accept it's trivial to fragment memory in a harmful way.
I've prototyped a test case yesterday that uses fio in the following way
to fragment memory

o fio of many small files (64K)
o create initial pages using writes that disable fallocate and create
  inodes on first open. This is massively inefficient from an IO
  perspective but it mixes slab and page cache allocations so all
  NUMA nodes get fragmented.
o Size the page cache so that it's 150% the size of memory so it forces
  reclaim activity and new fio activity to further mix slab and page
  cache allocations
o After initial write, run parallel readers to keep slab active and run
  this for the same length of time the initial writes took so fio has
  called stat() on the existing files and begun the read phase. This
  forces the slab and page cache pages to remain "live" and difficult
  to reclaim/compact.
o Finally, start a workload that allocates THP after the warmup phase
  but while fio is still runnning to measure allocation success rate
  and latencies

There are three configurations that use default advice, MADV_HUGEPAGE and
"always" fragment that is still running to cover the differet configurations
of interest. However, this is completly useless to you as even if this test
cases are fixed, there is no guarantee at all that it helps yours. They
are still being evaluated as they were recently prototyped but the mmtests
configurations in case someone wishes to independently evaluate are;

config-global-dhp__workload_thpfioscale
config-global-dhp__workload_thpfioscale-defrag
config-global-dhp__workload_thpfioscale-madvhugepage

It's best to run them on a dedicated test partition if possible. Locally
I've configured them to use a freshly created XFS filesystem.

If you're going to block fixes for problems that multiple people
experience then at least do the courtesy of providing a test case or
prototype patches for the complex alternatives you are proposing so they
can be independently evaluated. 

> Removing __GFP_THISNODE to avoid thrashing a zone will only 
> be beneficial when you can allocate remotely instead.  When you cannot 
> allocate remotely instead, you've made the problem much worse for 
> something that should be __GFP_NORETRY in the first place (and was for 
> years) and should never thrash.
> 
> I'm not interested in patches that require remote nodes to have an 
> abundance of free or unfragmented memory to avoid regressing.
> 

If only there was a test case that could reliably demonstrate this
independenly so it can be analysed and fixed *hint hint*. Better yet,
kindly prototype a fix.

> > In the case of distributions, the first
> > patch addresses concerns with a common workload where on the other hand
> > we have an internal workload of a single company that is affected --
> > which indirectly affects many users admittedly but only one entity directly.
> > 
> 
> The alternative, which is my patch, hasn't been tested or shown why it 
> cannot work.  We continue to talk about order >= pageblock_order vs
> __GFP_COMPACTONLY.
> 

You already received negative feedback on it. I worried that it would
conflate the requirements of THP and hugetlbfs with the latter usually
being more willing to tolerate initial latency to get the huge pages.
It's also being pointed out numerous times that the bug being addressed
is for a trivial workload on a clean system and not a case that depends
heavily on the system state. Normally the priority is to fix the trivial
case and work on the complex case.

Furthermore, Andrea has already stated that the complex alternatives will
take too long as they are non-trivial changes and that upstrewam cannot
be shipped (for RHEL presumably) with the existing bug. The bug being
addressed, and has a confirmed patch for, is addressing a severe regression
that "made important workloads unusuable". We are now in a situation where
distributions are likely to carry an out-of-tree patch to cover the trivial
case with a limited path forward as the blocking test case is unuavailable.

All that said, your patches results when tested were inconclusive at
best. For the trivial case, system CPU usage is indeed reduced to similar
levels to Michal's patch. However, the actual time to complete is 15.7%
longer than the vanilla kernel. Michal's patch completed 10.5% faster.

> I'd like to know, specifically:
> 
>  - what measurable affect my patch has that is better solved with removing
>    __GFP_THISNODE on systems where remote memory is also fragmented?
> 

In my case, it failed to fix the trivial case. That may be specific to
my machine or bad luck but nevertheless, it failed. The patch is also at
the RFC level with no test case that can be independently verified.

>  - what platforms benefit from remote access to hugepages vs accessing
>    local small pages (I've asked this maybe 4 or 5 times now)?
> 

That's a loaded question. In the case of a workload that has long
active phases that fit in L3 cache, the remote access penalty is masked.
Furthermore, virtualisation can benefit from THP even if remote due to
the reduced depth of page table walks combined with the fact that vcpus
may migrate to nodes accessing local memory due to either normal scheduler
waker/wakee migrations or automatic NUMA balancing.

>  - how is reclaiming (and possibly thrashing) memory helpful if compaction
>    fails to free an entire pageblock due to slab fragmentation due to low
>    on memory conditions and the page allocator preference to return node-
>    local memory?
> 

It is not, but it requires knowledge of the future to know if
reclaim/compaction will be truely beneficial. While it could be addressed
with tracking state and heuristics, no such state or heuristic has been
proposed. Furthermore, it's ortotogonal to the patch under discussion
and this is somewhat the crux of the matter. You are naking a fix for a
known trivial problem and proposing that people work on an unspecified,
unbounded, long-lived project as the basis for the nak without providing
even a prototype of what you propose.

>  - how is reclaiming (and possibly thrashing) memory helpful if compaction
>    cannot access the memory reclaimed because the freeing scanner has 
>    already passed by it, or the migration scanner has passed by it, since
>    this reclaim is not targeted to pages it can find?
> 

Again, knowledge of the future or a lot of memory scanning combined with
heuristics may be required to achieve what you propose.

>  - what metrics can be introduced to the page allocator so that we can
>    determine that reclaiming (and possibly thrashing) memory will result 
>    in a hugepage being allocated?
> 

Knowledge of the future to know as we cannot know in advance if reclaimn
will succeed without knowing if a workload is keeping contents of slab
and the LRU active.

> Until we have answers, especially for the last, there is no reason why thp 
> allocations should not be __GFP_NORETRY including for MADV_HUGEPAGE 
> regions.  The implementation of memory compaction simply cannot guarantee 
> that the cost is worthwhile.

Other than the fact that the __GFP_NORETRY patch failed at least one test
that is without a clear indication and changelog showing that it
addresses the problems with an unreleased workload.
Mel Gorman Oct. 23, 2018, 8:38 a.m. UTC | #39
On Tue, Oct 23, 2018 at 08:57:45AM +0100, Mel Gorman wrote:
> Note that I accept it's trivial to fragment memory in a harmful way.
> I've prototyped a test case yesterday that uses fio in the following way
> to fragment memory
> 
> o fio of many small files (64K)
> o create initial pages using writes that disable fallocate and create
>   inodes on first open. This is massively inefficient from an IO
>   perspective but it mixes slab and page cache allocations so all
>   NUMA nodes get fragmented.
> o Size the page cache so that it's 150% the size of memory so it forces
>   reclaim activity and new fio activity to further mix slab and page
>   cache allocations
> o After initial write, run parallel readers to keep slab active and run
>   this for the same length of time the initial writes took so fio has
>   called stat() on the existing files and begun the read phase. This
>   forces the slab and page cache pages to remain "live" and difficult
>   to reclaim/compact.
> o Finally, start a workload that allocates THP after the warmup phase
>   but while fio is still runnning to measure allocation success rate
>   and latencies
> 

The tests completed shortly after I wrote this mail so I can put some
figures to the intuitions expressed in this mail. I'm truncating the
reports for clarity but can upload the full data if necessary.

The target system is a 2-socket using E5-2670 v3 (Haswell). Base kernel
is 4.19. The baseline is an unpatched kernel. relaxthisnode-v1r1 is
patch 1 of Michal's series and does not include the second cleanup.
noretry-v1r1 is David's alternative

global-dhp__workload_usemem-stress-numa-compact
(no filesystem as this is the trivial case of allocating anonymous
 memory on a freshly booted system. Figures are elapsed time)

                                   4.19.0                 4.19.0                 4.19.0
                                  vanilla     relaxthisnode-v1r1           noretry-v1r1
Amean     System-1       14.16 (   0.00%)       12.35 *  12.75%*       15.96 * -12.70%*
Amean     System-3       15.14 (   0.00%)        9.83 *  35.08%*       11.00 *  27.34%*
Amean     System-4        9.88 (   0.00%)        9.85 (   0.25%)        9.80 (   0.75%)
Amean     Elapsd-1       29.23 (   0.00%)       26.16 *  10.50%*       33.81 * -15.70%*
Amean     Elapsd-3       25.67 (   0.00%)        7.28 *  71.63%*        8.49 *  66.93%*
Amean     Elapsd-4        5.49 (   0.00%)        5.53 (  -0.76%)        5.46 (   0.49%)

The figures in () are the percentage gain/loss. If it's around *'s then
the automation has guessed at the results are outside the noise.

System CPU usage is reduced by both as reported but Micha's gives a
10.5% gain and David's is a 15.7% loss. Boith appear to be outside the
noise. While not included here, the vanilla kernel swaps heavily with a 56%
reclaim efficiency (pages scanned vs pages reclaimed) and neither of the
proposed patches swaps and it's all from direct reclaim activity. Michal's
patch does not enter reclaim, David's enters reclaim but it's very light.

global-dhp__workload_thpfioscale-xfs
(Uses fio to fragment memory and keep slab and page cache active while
 there is an attempt to allocate THP in parallel. No special madvise
 flags or tuning is applied. A dedicated test partition is used for
 fio and XFS was the target filesystem that is recreated on every test)
thpfioscale Fault Latencies
                                       4.19.0                 4.19.0                 4.19.0
                                      vanilla     relaxthisnode-v1r1           noretry-v1r1
Amean     fault-base-5     1471.95 (   0.00%)     1515.64 (  -2.97%)     1491.05 (  -1.30%)
Amean     fault-huge-5        0.00 (   0.00%)      534.51 * -99.00%*        0.00 (   0.00%)

thpfioscale Percentage Faults Huge
                                  4.19.0                 4.19.0                 4.19.0
                                 vanilla     relaxthisnode-v1r1           noretry-v1r1
Percentage huge-5        0.00 (   0.00%)        1.18 ( 100.00%)        0.00 (   0.00%)

Both patches incur a slight hit to fault latency (measured in microseconds)
but it's well within the noise. While not included here, the variance is
massive (min 1052 microseconds, max 282348 microseconds in the vanilla
kernel. Both patches reduce the worst-case scenarios. All kernels show
terrible allocation success rates. Michal's had a 1.18% success rate but
that's probably luck.

global-dhp__workload_thpfioscale-madvhugepage-xfs
(Same as the last test but the THP allocation program uses
 MADV_HUGEPAGE)

thpfioscale Fault Latencies
                                       4.19.0                 4.19.0                 4.19.0
                                      vanilla     relaxthisnode-v1r1           noretry-v1r1
Amean     fault-base-5     6772.84 (   0.00%)    10256.30 * -51.43%*     1574.45 *  76.75%*
Amean     fault-huge-5     2644.19 (   0.00%)     5314.17 *-100.98%*     3517.89 ( -33.04%)

thpfioscale Percentage Faults Huge
                                  4.19.0                 4.19.0                 4.19.0
                                 vanilla     relaxthisnode-v1r1           noretry-v1r1
Percentage huge-5       45.48 (   0.00%)       95.09 ( 109.08%)        2.81 ( -93.81%

The first point of interest is that even with the vanilla kernel, the
allocation fault latency is much higher than average reflecting that
additional work is being done.

Next point of interest -- David's patch has much lower latency on
average when allocating *base* pages showing and the vmstats (not
included) show that compaction activity is reduced but not eliminated.

To balance this, Michal's patch has an 95% allocation success rate for THP
versus 45% on the default kernel at the cost of higher fault latency. This
is almost certainly a reflection that THPs are being allocated on remote
nodes. This can be considered good or bad depending on whether THP is
more important than locality. Note with David's patch that the allocation
success rate drops to 2.81% showing that it's much less efficient at THP.

This demonstrates a very clear trade-off between allocation latency and
allocation success rate for THP. Which one is better is workload
dependent.

global-dhp__workload_thpfioscale-defrag-xfs
(Same as global-dhp__workload_thpfioscale-xfs except that defrag is set
 to always)
thpfioscale Fault Latencies
                                       4.19.0                 4.19.0                 4.19.0
                                      vanilla     relaxthisnode-v1r1           noretry-v1r1
Amean     fault-base-5     2678.60 (   0.00%)     4442.14 * -65.84%*     1640.15 *  38.77%*
Amean     fault-huge-5     1324.61 (   0.00%)     1460.08 ( -10.23%)     2358.23 ( -78.03%)

thpfioscale Percentage Faults Huge
                                  4.19.0                 4.19.0                 4.19.0
                                 vanilla     relaxthisnode-v1r1           noretry-v1r1
Percentage huge-5        0.90 (   0.00%)        0.40 ( -55.56%)        0.22 ( -75.93%)

The allocation latency is again higher in this case as greater effort is
made to allocate the huge page. Michal's takes a hit as it's still
trying to allocate the THP while David's gives up early. In all cases
the allocation success rate is terrible.

So it should be reasonably clear that no approach is a universal win.
Michal's wins at the trivial case which is what the original problem
was and why it was pushed at all. David's in general has lower latency
in general because it gives up quickly but the allocation success rate
when MADV_HUGEPAGE specifically asks for huge pages is terrible. This
may make it a non-starter for the virtualisation case that wants huge
pages on the basis that if an application asks for huge pages, it
presumably is willing to pay the cost to get them.
David Rientjes Oct. 28, 2018, 9:45 p.m. UTC | #40
On Mon, 22 Oct 2018, Zi Yan wrote:

> Hi David,
> 

Hi!

> On 22 Oct 2018, at 17:04, David Rientjes wrote:
> 
> > On Tue, 16 Oct 2018, Mel Gorman wrote:
> > 
> > > I consider this to be an unfortunate outcome. On the one hand, we have a
> > > problem that three people can trivially reproduce with known test cases
> > > and a patch shown to resolve the problem. Two of those three people work
> > > on distributions that are exposed to a large number of users. On the
> > > other, we have a problem that requires the system to be in a specific
> > > state and an unknown workload that suffers badly from the remote access
> > > penalties with a patch that has review concerns and has not been proven
> > > to resolve the trivial cases.
> > 
> > The specific state is that remote memory is fragmented as well, this is
> > not atypical.  Removing __GFP_THISNODE to avoid thrashing a zone will only
> > be beneficial when you can allocate remotely instead.  When you cannot
> > allocate remotely instead, you've made the problem much worse for
> > something that should be __GFP_NORETRY in the first place (and was for
> > years) and should never thrash.
> > 
> > I'm not interested in patches that require remote nodes to have an
> > abundance of free or unfragmented memory to avoid regressing.
> 
> I just wonder what is the page allocation priority list in your environment,
> assuming all memory nodes are so fragmented that no huge pages can be
> obtained without compaction or reclaim.
> 
> Here is my version of that list, please let me know if it makes sense to you:
> 
> 1. local huge pages: with compaction and/or page reclaim, you are willing
> to pay the penalty of getting huge pages;
> 
> 2. local base pages: since, in your system, remote data accesses have much
> higher penalty than the extra TLB misses incurred by the base page size;
> 
> 3. remote huge pages: at least it is better than remote base pages;
> 
> 4. remote base pages: it performs worst in terms of locality and TLBs.
> 

I have a ton of different platforms available.  Consider a very basic 
access latency evaluation on Broadwell on a running production system: 
remote hugepage vs remote PAGE_SIZE pages had about 5% better access 
latency.  Remote PAGE_SIZE pages vs local pages is a 12% degradation.  On 
Naples, remote hugepage vs remote PAGE_SIZE had 2% better access latency 
intrasocket, no better access latency intersocket.  Remote PAGE_SIZE pages 
vs local is a 16% degradation intrasocket and 38% degradation intersocket.

My list removes (3) from your list, but is otherwise unchanged.  I remove 
(3) because 2-5% better access latency is nice, but we'd much rather fault 
local base pages and then let khugepaged collapse it into a local hugepage 
when fragmentation is improved or we have freed memory.  That is where we 
can get a much better result, 41% better access latency on Broadwell and 
52% better access latncy on Naples.  I wouldn't trade that for 2-5% 
immediate remote hugepages.

It just so happens that prior to this patch, the implementation of the 
page allocator matches this preference.

> In addition, to prioritize local base pages over remote pages,
> the original huge page allocation has to fail, then kernel can
> fall back to base page allocations. And you will never get remote
> huge pages any more if the local base page allocation fails,
> because there is no way back to huge page allocation after the fallback.
> 

That is exactly what we want, we want khugepaged to collapse memory into 
local hugepages for the big improvement rather than persistently access a 
hugepage remotely; the win of the remote hugepage just isn't substantial 
enough, and the win of the local hugepage is just too great.

> > I'd like to know, specifically:
> > 
> >  - what measurable affect my patch has that is better solved with removing
> >    __GFP_THISNODE on systems where remote memory is also fragmented?
> > 
> >  - what platforms benefit from remote access to hugepages vs accessing
> >    local small pages (I've asked this maybe 4 or 5 times now)?
> > 
> >  - how is reclaiming (and possibly thrashing) memory helpful if compaction
> >    fails to free an entire pageblock due to slab fragmentation due to low
> >    on memory conditions and the page allocator preference to return node-
> >    local memory?
> > 
> >  - how is reclaiming (and possibly thrashing) memory helpful if compaction
> >    cannot access the memory reclaimed because the freeing scanner has
> >    already passed by it, or the migration scanner has passed by it, since
> >    this reclaim is not targeted to pages it can find?
> > 
> >  - what metrics can be introduced to the page allocator so that we can
> >    determine that reclaiming (and possibly thrashing) memory will result
> >    in a hugepage being allocated?
> 
> The slab fragmentation and whether reclaim/compaction can help form
> huge pages seem to orthogonal to this patch, which tries to decide
> the priority between locality and huge pages.
> 

It's not orthogonal to the problem being reported which requires local 
memory pressure.  If there is no memory pressure, compaction often can 
succeed without reclaim because the freeing scanner can find target 
memory and the migration scanner can make a pageblock free.  Under memory 
pressure, however, where Andrea is experiencing the thrashing of the local 
node, by this time it can be inferred that slab pages have already fallen 
bcak to MIGRATE_MOVABLE pageblocks.  There is nothing preventing it under 
memory pressure because of the preference to return local memory over 
fragmenting pageblocks.

So the point of slab fragmentation, which typically exists locally when 
there is memory pressure, is that we cannot ascertain whether memory 
compaction even with reclaim will be successful.  Not only because the 
freeing scanner cannot access reclaimed memory, but also because we have 
no feedback from compaction to determine whether the work will be useful.  
Thrashing the local node, migrating COMPACT_CLUSTER_MAX pages, finding one 
slab page sitting in the pageblock, and continuing is not a good use of 
the allocator's time.  This is true of both MADV_HUGEPAGE and 
non-MADV_HUGEPAGE regions.

For reclaim to be considered, we should ensure that work is useful to 
compaction.  That ability is non-existant.  The worst case scenario is you 
thrash the local node and still cannot allocate a hugepage.
Education Directorate Oct. 29, 2018, 5:17 a.m. UTC | #41
On Tue, Sep 25, 2018 at 02:03:25PM +0200, Michal Hocko wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> THP allocation might be really disruptive when allocated on NUMA system
> with the local node full or hard to reclaim. Stefan has posted an
> allocation stall report on 4.12 based SLES kernel which suggests the
> same issue:
> 
> [245513.362669] kvm: page allocation stalls for 194572ms, order:9, mode:0x4740ca(__GFP_HIGHMEM|__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|__GFP_MOVABLE|__GFP_DIRECT_RECLAIM), nodemask=(null)
> [245513.363983] kvm cpuset=/ mems_allowed=0-1
> [245513.364604] CPU: 10 PID: 84752 Comm: kvm Tainted: G        W 4.12.0+98-ph <a href="/view.php?id=1" title="[geschlossen] Integration Ramdisk" class="resolved">0000001</a> SLE15 (unreleased)
> [245513.365258] Hardware name: Supermicro SYS-1029P-WTRT/X11DDW-NT, BIOS 2.0 12/05/2017
> [245513.365905] Call Trace:
> [245513.366535]  dump_stack+0x5c/0x84
> [245513.367148]  warn_alloc+0xe0/0x180
> [245513.367769]  __alloc_pages_slowpath+0x820/0xc90
> [245513.368406]  ? __slab_free+0xa9/0x2f0
> [245513.369048]  ? __slab_free+0xa9/0x2f0
> [245513.369671]  __alloc_pages_nodemask+0x1cc/0x210
> [245513.370300]  alloc_pages_vma+0x1e5/0x280
> [245513.370921]  do_huge_pmd_wp_page+0x83f/0xf00
> [245513.371554]  ? set_huge_zero_page.isra.52.part.53+0x9b/0xb0
> [245513.372184]  ? do_huge_pmd_anonymous_page+0x631/0x6d0
> [245513.372812]  __handle_mm_fault+0x93d/0x1060
> [245513.373439]  handle_mm_fault+0xc6/0x1b0
> [245513.374042]  __do_page_fault+0x230/0x430
> [245513.374679]  ? get_vtime_delta+0x13/0xb0
> [245513.375411]  do_page_fault+0x2a/0x70
> [245513.376145]  ? page_fault+0x65/0x80
> [245513.376882]  page_fault+0x7b/0x80
> [...]
> [245513.382056] Mem-Info:
> [245513.382634] active_anon:126315487 inactive_anon:1612476 isolated_anon:5
>                  active_file:60183 inactive_file:245285 isolated_file:0
>                  unevictable:15657 dirty:286 writeback:1 unstable:0
>                  slab_reclaimable:75543 slab_unreclaimable:2509111
>                  mapped:81814 shmem:31764 pagetables:370616 bounce:0
>                  free:32294031 free_pcp:6233 free_cma:0
> [245513.386615] Node 0 active_anon:254680388kB inactive_anon:1112760kB active_file:240648kB inactive_file:981168kB unevictable:13368kB isolated(anon):0kB isolated(file):0kB mapped:280240kB dirty:1144kB writeback:0kB shmem:95832kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 81225728kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> [245513.388650] Node 1 active_anon:250583072kB inactive_anon:5337144kB active_file:84kB inactive_file:0kB unevictable:49260kB isolated(anon):20kB isolated(file):0kB mapped:47016kB dirty:0kB writeback:4kB shmem:31224kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 31897600kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> 
> The defrag mode is "madvise" and from the above report it is clear that
> the THP has been allocated for MADV_HUGEPAGA vma.
> 
> Andrea has identified that the main source of the problem is
> __GFP_THISNODE usage:
> 
> : The problem is that direct compaction combined with the NUMA
> : __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very
> : hard the local node, instead of failing the allocation if there's no
> : THP available in the local node.
> :
> : Such logic was ok until __GFP_THISNODE was added to the THP allocation
> : path even with MPOL_DEFAULT.
> :
> : The idea behind the __GFP_THISNODE addition, is that it is better to
> : provide local memory in PAGE_SIZE units than to use remote NUMA THP
> : backed memory. That largely depends on the remote latency though, on
> : threadrippers for example the overhead is relatively low in my
> : experience.
> :
> : The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in
> : extremely slow qemu startup with vfio, if the VM is larger than the
> : size of one host NUMA node. This is because it will try very hard to
> : unsuccessfully swapout get_user_pages pinned pages as result of the
> : __GFP_THISNODE being set, instead of falling back to PAGE_SIZE
> : allocations and instead of trying to allocate THP on other nodes (it
> : would be even worse without vfio type1 GUP pins of course, except it'd
> : be swapping heavily instead).
> 
> Fix this by removing __GFP_THISNODE for THP requests which are
> requesting the direct reclaim. This effectivelly reverts 5265047ac301 on
> the grounds that the zone/node reclaim was known to be disruptive due
> to premature reclaim when there was memory free. While it made sense at
> the time for HPC workloads without NUMA awareness on rare machines, it
> was ultimately harmful in the majority of cases. The existing behaviour
> is similiar, if not as widespare as it applies to a corner case but
> crucially, it cannot be tuned around like zone_reclaim_mode can. The
> default behaviour should always be to cause the least harm for the
> common case.
> 
> If there are specialised use cases out there that want zone_reclaim_mode
> in specific cases, then it can be built on top. Longterm we should
> consider a memory policy which allows for the node reclaim like behavior
> for the specific memory ranges which would allow a
> 
> [1] http://lkml.kernel.org/r/20180820032204.9591-1-aarcange@redhat.com
> 


I think we have a similar problem elsewhere too

I've run into cases where alloc_pool_huge_page() took forever looping
in reclaim via compaction_test. My tests and tracing eventually showed
that the root cause was we were looping in should_continue_reclaim()
due to __GFP_RETRY_MAYFAIL (set in alloc_fresh_huge_page()). The
scanned value was much lesser than sc->order. I have a small RFC
patch that I am testing and it seems good to so far, having said that
the issue is hard to reproduce and takes a while to hit.

I wonder if alloc_pool_huge_page() should also trim out it's logic
of __GFP_THISNODE for the same reasons as mentioned here. I like
that we round robin to alloc the pool pages, but __GFP_THISNODE
might be an overkill for that case as well.

Balbir Singh.
Michal Hocko Oct. 29, 2018, 9 a.m. UTC | #42
On Mon 29-10-18 16:17:52, Balbir Singh wrote:
[...]
> I wonder if alloc_pool_huge_page() should also trim out it's logic
> of __GFP_THISNODE for the same reasons as mentioned here. I like
> that we round robin to alloc the pool pages, but __GFP_THISNODE
> might be an overkill for that case as well.

alloc_pool_huge_page uses __GFP_THISNODE for a different reason than
THP. We really do want to allocated for a per-node pool. THP can
fallback or use a different node.

These hugetlb allocations might be disruptive and that is an expected
behavior because this is an explicit requirement from an admin to
pre-allocate large pages for the future use. __GFP_RETRY_MAYFAIL just
underlines that requirement.

Maybe the compaction logic could be improved and that might be a shared
goal with future changes though.
Education Directorate Oct. 29, 2018, 9:42 a.m. UTC | #43
On Mon, Oct 29, 2018 at 10:00:35AM +0100, Michal Hocko wrote:
> On Mon 29-10-18 16:17:52, Balbir Singh wrote:
> [...]
> > I wonder if alloc_pool_huge_page() should also trim out it's logic
> > of __GFP_THISNODE for the same reasons as mentioned here. I like
> > that we round robin to alloc the pool pages, but __GFP_THISNODE
> > might be an overkill for that case as well.
> 
> alloc_pool_huge_page uses __GFP_THISNODE for a different reason than
> THP. We really do want to allocated for a per-node pool. THP can
> fallback or use a different node.
> 

Not really

static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
...
        gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
...
        for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
                page = alloc_fresh_huge_page(h, gfp_mask, node, nodes_allowed);
                if (page)
                        break;
        }


The code just tries to distribute the pool

> These hugetlb allocations might be disruptive and that is an expected
> behavior because this is an explicit requirement from an admin to
> pre-allocate large pages for the future use. __GFP_RETRY_MAYFAIL just
> underlines that requirement.

Yes, but in the absence of a particular node, for example via sysctl
(as the compaction does), I don't think it is a hard requirement to get
a page from a particular node. I agree we need __GFP_RETRY_FAIL, in any
case the real root cause for me is should_reclaim_continue() which keeps
the task looping without making forward progress.

The __GFP_THISNODE was again an example of mis-leading the allocator
in this case, IMHO.

> 
> Maybe the compaction logic could be improved and that might be a shared
> goal with future changes though.

I'll also send my RFC once my testing is done, assuming I get it to reproduce
with a desired frequency.

Balbir Singh.
Michal Hocko Oct. 29, 2018, 10:08 a.m. UTC | #44
On Mon 29-10-18 20:42:53, Balbir Singh wrote:
> On Mon, Oct 29, 2018 at 10:00:35AM +0100, Michal Hocko wrote:
[...]
> > These hugetlb allocations might be disruptive and that is an expected
> > behavior because this is an explicit requirement from an admin to
> > pre-allocate large pages for the future use. __GFP_RETRY_MAYFAIL just
> > underlines that requirement.
> 
> Yes, but in the absence of a particular node, for example via sysctl
> (as the compaction does), I don't think it is a hard requirement to get
> a page from a particular node.

Again this seems like a deliberate decision. You want your distributions
as even as possible otherwise the NUMA placement will be much less
deterministic. At least that was the case for a long time. If you
have different per-node preferences, just use NUMA aware pre-allocation.

> I agree we need __GFP_RETRY_FAIL, in any
> case the real root cause for me is should_reclaim_continue() which keeps
> the task looping without making forward progress.

This seems like a separate issue which should better be debugged. Please
open a new thread describing the problem and the state of the node.
Andrea Arcangeli Oct. 29, 2018, 10:56 a.m. UTC | #45
Hello,

On Mon, Oct 29, 2018 at 11:08:34AM +0100, Michal Hocko wrote:
> This seems like a separate issue which should better be debugged. Please
> open a new thread describing the problem and the state of the node.

Yes, in my view it should be evaluated separately too, because it's
overall less concerning: __GFP_THISNODE there can only be set by the
root user there. So it has a chance to be legitimate behavior
there. Let's focus on solving the __GFP_THISNODE that any user in the
system can set (not only root) and cause severe and unexpected swap
storms or slowdowns to all other processes run by other users.

ls -l /sys/kernel/mm/hugepages/*/nr_hugepages

(and boot command line)

Thanks,
Andrea
diff mbox series

Patch

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index da858f794eb6..149b6f4cf023 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2046,8 +2046,36 @@  alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		nmask = policy_nodemask(gfp, pol);
 		if (!nmask || node_isset(hpage_node, *nmask)) {
 			mpol_cond_put(pol);
-			page = __alloc_pages_node(hpage_node,
-						gfp | __GFP_THISNODE, order);
+			/*
+			 * We cannot invoke reclaim if __GFP_THISNODE
+			 * is set. Invoking reclaim with
+			 * __GFP_THISNODE set, would cause THP
+			 * allocations to trigger heavy swapping
+			 * despite there may be tons of free memory
+			 * (including potentially plenty of THP
+			 * already available in the buddy) on all the
+			 * other NUMA nodes.
+			 *
+			 * At most we could invoke compaction when
+			 * __GFP_THISNODE is set (but we would need to
+			 * refrain from invoking reclaim even if
+			 * compaction returned COMPACT_SKIPPED because
+			 * there wasn't not enough memory to succeed
+			 * compaction). For now just avoid
+			 * __GFP_THISNODE instead of limiting the
+			 * allocation path to a strict and single
+			 * compaction invocation.
+			 *
+			 * Supposedly if direct reclaim was enabled by
+			 * the caller, the app prefers THP regardless
+			 * of the node it comes from so this would be
+			 * more desiderable behavior than only
+			 * providing THP originated from the local
+			 * node in such case.
+			 */
+			if (!(gfp & __GFP_DIRECT_RECLAIM))
+				gfp |= __GFP_THISNODE;
+			page = __alloc_pages_node(hpage_node, gfp, order);
 			goto out;
 		}
 	}