Message ID | 20200616204527.19185-1-nigupta@nvidia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v8] mm: Proactive compaction | expand |
On Tue, 16 Jun 2020 13:45:27 -0700 Nitin Gupta <nigupta@nvidia.com> wrote: > For some applications, we need to allocate almost all memory as > hugepages. However, on a running system, higher-order allocations can > fail if the memory is fragmented. Linux kernel currently does on-demand > compaction as we request more hugepages, but this style of compaction > incurs very high latency. Experiments with one-time full memory > compaction (followed by hugepage allocations) show that kernel is able > to restore a highly fragmented memory state to a fairly compacted memory > state within <1 sec for a 32G system. Such data suggests that a more > proactive compaction can help us allocate a large fraction of memory as > hugepages keeping allocation latencies low. > > ... > All looks straightforward to me and easy to disable if it goes wrong. All the hard-coded magic numbers are a worry, but such is life. One teeny complaint: > > ... > > @@ -2650,12 +2801,34 @@ static int kcompactd(void *p) > unsigned long pflags; > > trace_mm_compaction_kcompactd_sleep(pgdat->node_id); > - wait_event_freezable(pgdat->kcompactd_wait, > - kcompactd_work_requested(pgdat)); > + if (wait_event_freezable_timeout(pgdat->kcompactd_wait, > + kcompactd_work_requested(pgdat), > + msecs_to_jiffies(HPAGE_FRAG_CHECK_INTERVAL_MSEC))) { > + > + psi_memstall_enter(&pflags); > + kcompactd_do_work(pgdat); > + psi_memstall_leave(&pflags); > + continue; > + } > > - psi_memstall_enter(&pflags); > - kcompactd_do_work(pgdat); > - psi_memstall_leave(&pflags); > + /* kcompactd wait timeout */ > + if (should_proactive_compact_node(pgdat)) { > + unsigned int prev_score, score; Everywhere else, scores have type `int'. Here they are unsigned. How come? Would it be better to make these unsigned throughout? I don't think a score can ever be negative? > + if (proactive_defer) { > + proactive_defer--; > + continue; > + } > + prev_score = fragmentation_score_node(pgdat); > + proactive_compact_node(pgdat); > + score = fragmentation_score_node(pgdat); > + /* > + * Defer proactive compaction if the fragmentation > + * score did not go down i.e. no progress made. > + */ > + proactive_defer = score < prev_score ? > + 0 : 1 << COMPACT_MAX_DEFER_SHIFT; > + } > }
On 6/17/20 1:53 PM, Andrew Morton wrote: > On Tue, 16 Jun 2020 13:45:27 -0700 Nitin Gupta <nigupta@nvidia.com> wrote: > >> For some applications, we need to allocate almost all memory as >> hugepages. However, on a running system, higher-order allocations can >> fail if the memory is fragmented. Linux kernel currently does on-demand >> compaction as we request more hugepages, but this style of compaction >> incurs very high latency. Experiments with one-time full memory >> compaction (followed by hugepage allocations) show that kernel is able >> to restore a highly fragmented memory state to a fairly compacted memory >> state within <1 sec for a 32G system. Such data suggests that a more >> proactive compaction can help us allocate a large fraction of memory as >> hugepages keeping allocation latencies low. >> >> ... >> > > All looks straightforward to me and easy to disable if it goes wrong. > > All the hard-coded magic numbers are a worry, but such is life. > > One teeny complaint: > >> >> ... >> >> @@ -2650,12 +2801,34 @@ static int kcompactd(void *p) >> unsigned long pflags; >> >> trace_mm_compaction_kcompactd_sleep(pgdat->node_id); >> - wait_event_freezable(pgdat->kcompactd_wait, >> - kcompactd_work_requested(pgdat)); >> + if (wait_event_freezable_timeout(pgdat->kcompactd_wait, >> + kcompactd_work_requested(pgdat), >> + msecs_to_jiffies(HPAGE_FRAG_CHECK_INTERVAL_MSEC))) { >> + >> + psi_memstall_enter(&pflags); >> + kcompactd_do_work(pgdat); >> + psi_memstall_leave(&pflags); >> + continue; >> + } >> >> - psi_memstall_enter(&pflags); >> - kcompactd_do_work(pgdat); >> - psi_memstall_leave(&pflags); >> + /* kcompactd wait timeout */ >> + if (should_proactive_compact_node(pgdat)) { >> + unsigned int prev_score, score; > > Everywhere else, scores have type `int'. Here they are unsigned. How come? > > Would it be better to make these unsigned throughout? I don't think a > score can ever be negative? > The score is always in [0, 100], so yes, it should be unsigned. I will send another patch which fixes this. Thanks, Nitin
On Tue, Jun 16, 2020 at 01:45:27PM -0700, Nitin Gupta wrote: > For some applications, we need to allocate almost all memory as > hugepages. However, on a running system, higher-order allocations can > fail if the memory is fragmented. Linux kernel currently does on-demand > compaction as we request more hugepages, but this style of compaction > incurs very high latency. Experiments with one-time full memory > compaction (followed by hugepage allocations) show that kernel is able > to restore a highly fragmented memory state to a fairly compacted memory > state within <1 sec for a 32G system. Such data suggests that a more > proactive compaction can help us allocate a large fraction of memory as > hugepages keeping allocation latencies low. > > For a more proactive compaction, the approach taken here is to define a > new sysctl called 'vm.compaction_proactiveness' which dictates bounds > for external fragmentation which kcompactd tries to maintain. > > The tunable takes a value in range [0, 100], with a default of 20. > > Note that a previous version of this patch [1] was found to introduce > too many tunables (per-order extfrag{low, high}), but this one reduces > them to just one sysctl. Also, the new tunable is an opaque value > instead of asking for specific bounds of "external fragmentation", which > would have been difficult to estimate. The internal interpretation of > this opaque value allows for future fine-tuning. > > Currently, we use a simple translation from this tunable to [low, high] > "fragmentation score" thresholds (low=100-proactiveness, high=low+10%). > The score for a node is defined as weighted mean of per-zone external > fragmentation. A zone's present_pages determines its weight. > > To periodically check per-node score, we reuse per-node kcompactd > threads, which are woken up every 500 milliseconds to check the same. If > a node's score exceeds its high threshold (as derived from user-provided > proactiveness value), proactive compaction is started until its score > reaches its low threshold value. By default, proactiveness is set to 20, > which implies threshold values of low=80 and high=90. > > This patch is largely based on ideas from Michal Hocko [2]. See also the > LWN article [3]. > > Performance data > ================ > > System: x64_64, 1T RAM, 80 CPU threads. > Kernel: 5.6.0-rc3 + this patch > > echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled > echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag > > Before starting the driver, the system was fragmented from a userspace > program that allocates all memory and then for each 2M aligned section, > frees 3/4 of base pages using munmap. The workload is mainly anonymous > userspace pages, which are easy to move around. I intentionally avoided > unmovable pages in this test to see how much latency we incur when > hugepage allocations hit direct compaction. > > 1. Kernel hugepage allocation latencies > > With the system in such a fragmented state, a kernel driver then > allocates as many hugepages as possible and measures allocation > latency: > > (all latency values are in microseconds) > > - With vanilla 5.6.0-rc3 > > percentile latency > –––––––––– ––––––– > 5 7894 > 10 9496 > 25 12561 > 30 15295 > 40 18244 > 50 21229 > 60 27556 > 75 30147 > 80 31047 > 90 32859 > 95 33799 > > Total 2M hugepages allocated = 383859 (749G worth of hugepages out of > 762G total free => 98% of free memory could be allocated as hugepages) > > - With 5.6.0-rc3 + this patch, with proactiveness=20 > > sysctl -w vm.compaction_proactiveness=20 > > percentile latency > –––––––––– ––––––– > 5 2 > 10 2 > 25 3 > 30 3 > 40 3 > 50 4 > 60 4 > 75 4 > 80 4 > 90 5 > 95 429 > > Total 2M hugepages allocated = 384105 (750G worth of hugepages out of > 762G total free => 98% of free memory could be allocated as hugepages) > > 2. JAVA heap allocation > > In this test, we first fragment memory using the same method as for (1). > > Then, we start a Java process with a heap size set to 700G and request > the heap to be allocated with THP hugepages. We also set THP to madvise > to allow hugepage backing of this heap. > > /usr/bin/time > java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch > > The above command allocates 700G of Java heap using hugepages. > > - With vanilla 5.6.0-rc3 > > 17.39user 1666.48system 27:37.89elapsed > > - With 5.6.0-rc3 + this patch, with proactiveness=20 > > 8.35user 194.58system 3:19.62elapsed > > Elapsed time remains around 3:15, as proactiveness is further increased. > > Note that proactive compaction happens throughout the runtime of these > workloads. The situation of one-time compaction, sufficient to supply > hugepages for following allocation stream, can probably happen for more > extreme proactiveness values, like 80 or 90. > > In the above Java workload, proactiveness is set to 20. The test starts > with a node's score of 80 or higher, depending on the delay between the > fragmentation step and starting the benchmark, which gives more-or-less > time for the initial round of compaction. As t he benchmark consumes > hugepages, node's score quickly rises above the high threshold (90) and > proactive compaction starts again, which brings down the score to the > low threshold level (80). Repeat. > > bpftrace also confirms proactive compaction running 20+ times during the > runtime of this Java benchmark. kcompactd threads consume 100% of one of > the CPUs while it tries to bring a node's score within thresholds. > > Backoff behavior > ================ > > Above workloads produce a memory state which is easy to compact. > However, if memory is filled with unmovable pages, proactive compaction > should essentially back off. To test this aspect: > > - Created a kernel driver that allocates almost all memory as hugepages > followed by freeing first 3/4 of each hugepage. > - Set proactiveness=40 > - Note that proactive_compact_node() is deferred maximum number of times > with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check > (=> ~30 seconds between retries). > > [1] https://patchwork.kernel.org/patch/11098289/ > [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/ > [3] https://lwn.net/Articles/817905/ > > Signed-off-by: Nitin Gupta <nigupta@nvidia.com> > Reviewed-by: Vlastimil Babka <vbabka@suse.cz> > Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> > Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com> > Tested-by: Oleksandr Natalenko <oleksandr@redhat.com> > To: Andrew Morton <akpm@linux-foundation.org> > CC: Vlastimil Babka <vbabka@suse.cz> > CC: Khalid Aziz <khalid.aziz@oracle.com> > CC: Michal Hocko <mhocko@suse.com> > CC: Mel Gorman <mgorman@techsingularity.net> > CC: Matthew Wilcox <willy@infradead.org> > CC: Mike Kravetz <mike.kravetz@oracle.com> > CC: Joonsoo Kim <iamjoonsoo.kim@lge.com> > CC: David Rientjes <rientjes@google.com> > CC: Nitin Gupta <ngupta@nitingupta.dev> > CC: Oleksandr Natalenko <oleksandr@redhat.com> > CC: linux-kernel <linux-kernel@vger.kernel.org> > CC: linux-mm <linux-mm@kvack.org> > CC: Linux API <linux-api@vger.kernel.org> This is now in -next and causes the following build failure: $ make -skj"$(nproc)" ARCH=mips CROSS_COMPILE=mipsel-linux- O=out/mipsel distclean malta_kvm_guest_defconfig mm/compaction.o In file included from include/linux/dev_printk.h:14, from include/linux/device.h:15, from include/linux/node.h:18, from include/linux/cpu.h:17, from mm/compaction.c:11: In function 'fragmentation_score_zone', inlined from '__compact_finished' at mm/compaction.c:1982:11, inlined from 'compact_zone' at mm/compaction.c:2062:8: include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301' declared with attribute error: BUILD_BUG failed 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^ include/linux/compiler.h:320:4: note: in definition of macro '__compiletime_assert' 320 | prefix ## suffix(); \ | ^~~~~~ include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime_assert' 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^~~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) | ^~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG' 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") | ^~~~~~~~~~~~~~~~ arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BUG' 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) | ^~~~~~~~~ mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER | ^~~~~~~~~~~~~~~~~~ mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_ORDER' 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); | ^~~~~~~~~~~~~~~~~~~~~~ In function 'fragmentation_score_zone', inlined from 'kcompactd' at mm/compaction.c:1918:12: include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301' declared with attribute error: BUILD_BUG failed 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^ include/linux/compiler.h:320:4: note: in definition of macro '__compiletime_assert' 320 | prefix ## suffix(); \ | ^~~~~~ include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime_assert' 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^~~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) | ^~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG' 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") | ^~~~~~~~~~~~~~~~ arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BUG' 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) | ^~~~~~~~~ mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER | ^~~~~~~~~~~~~~~~~~ mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_ORDER' 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); | ^~~~~~~~~~~~~~~~~~~~~~ In function 'fragmentation_score_zone', inlined from 'kcompactd' at mm/compaction.c:1918:12: include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301' declared with attribute error: BUILD_BUG failed 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^ include/linux/compiler.h:320:4: note: in definition of macro '__compiletime_assert' 320 | prefix ## suffix(); \ | ^~~~~~ include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime_assert' 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^~~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) | ^~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG' 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") | ^~~~~~~~~~~~~~~~ arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BUG' 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) | ^~~~~~~~~ mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER | ^~~~~~~~~~~~~~~~~~ mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_ORDER' 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); | ^~~~~~~~~~~~~~~~~~~~~~ In function 'fragmentation_score_zone', inlined from 'kcompactd' at mm/compaction.c:1918:12: include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301' declared with attribute error: BUILD_BUG failed 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^ include/linux/compiler.h:320:4: note: in definition of macro '__compiletime_assert' 320 | prefix ## suffix(); \ | ^~~~~~ include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime_assert' 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^~~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) | ^~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG' 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") | ^~~~~~~~~~~~~~~~ arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BUG' 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) | ^~~~~~~~~ mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER | ^~~~~~~~~~~~~~~~~~ mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_ORDER' 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); | ^~~~~~~~~~~~~~~~~~~~~~ make[3]: *** [scripts/Makefile.build:281: mm/compaction.o] Error 1 make[3]: Target '__build' not remade because of errors. make[2]: *** [Makefile:1765: mm] Error 2 make[2]: Target 'mm/compaction.o' not remade because of errors. make[1]: *** [Makefile:336: __build_one_by_one] Error 2 make[1]: Target 'distclean' not remade because of errors. make[1]: Target 'malta_kvm_guest_defconfig' not remade because of errors. make[1]: Target 'mm/compaction.o' not remade because of errors. make: *** [Makefile:185: __sub-make] Error 2 make: Target 'distclean' not remade because of errors. make: Target 'malta_kvm_guest_defconfig' not remade because of errors. make: Target 'mm/compaction.o' not remade because of errors. I am not sure why MIPS is special with its handling of hugepage support but I am far from a MIPS expert :) Cheers, Nathan
On 06/23/2020 10:26 AM, Nathan Chancellor wrote: > On Tue, Jun 16, 2020 at 01:45:27PM -0700, Nitin Gupta wrote: >> For some applications, we need to allocate almost all memory as >> hugepages. However, on a running system, higher-order allocations can >> fail if the memory is fragmented. Linux kernel currently does on-demand >> compaction as we request more hugepages, but this style of compaction >> incurs very high latency. Experiments with one-time full memory >> compaction (followed by hugepage allocations) show that kernel is able >> to restore a highly fragmented memory state to a fairly compacted memory >> state within <1 sec for a 32G system. Such data suggests that a more >> proactive compaction can help us allocate a large fraction of memory as >> hugepages keeping allocation latencies low. >> >> For a more proactive compaction, the approach taken here is to define a >> new sysctl called 'vm.compaction_proactiveness' which dictates bounds >> for external fragmentation which kcompactd tries to maintain. >> >> The tunable takes a value in range [0, 100], with a default of 20. >> >> Note that a previous version of this patch [1] was found to introduce >> too many tunables (per-order extfrag{low, high}), but this one reduces >> them to just one sysctl. Also, the new tunable is an opaque value >> instead of asking for specific bounds of "external fragmentation", which >> would have been difficult to estimate. The internal interpretation of >> this opaque value allows for future fine-tuning. >> >> Currently, we use a simple translation from this tunable to [low, high] >> "fragmentation score" thresholds (low=100-proactiveness, high=low+10%). >> The score for a node is defined as weighted mean of per-zone external >> fragmentation. A zone's present_pages determines its weight. >> >> To periodically check per-node score, we reuse per-node kcompactd >> threads, which are woken up every 500 milliseconds to check the same. If >> a node's score exceeds its high threshold (as derived from user-provided >> proactiveness value), proactive compaction is started until its score >> reaches its low threshold value. By default, proactiveness is set to 20, >> which implies threshold values of low=80 and high=90. >> >> This patch is largely based on ideas from Michal Hocko [2]. See also the >> LWN article [3]. >> >> Performance data >> ================ >> >> System: x64_64, 1T RAM, 80 CPU threads. >> Kernel: 5.6.0-rc3 + this patch >> >> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled >> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag >> >> Before starting the driver, the system was fragmented from a userspace >> program that allocates all memory and then for each 2M aligned section, >> frees 3/4 of base pages using munmap. The workload is mainly anonymous >> userspace pages, which are easy to move around. I intentionally avoided >> unmovable pages in this test to see how much latency we incur when >> hugepage allocations hit direct compaction. >> >> 1. Kernel hugepage allocation latencies >> >> With the system in such a fragmented state, a kernel driver then >> allocates as many hugepages as possible and measures allocation >> latency: >> >> (all latency values are in microseconds) >> >> - With vanilla 5.6.0-rc3 >> >> percentile latency >> –––––––––– ––––––– >> 5 7894 >> 10 9496 >> 25 12561 >> 30 15295 >> 40 18244 >> 50 21229 >> 60 27556 >> 75 30147 >> 80 31047 >> 90 32859 >> 95 33799 >> >> Total 2M hugepages allocated = 383859 (749G worth of hugepages out of >> 762G total free => 98% of free memory could be allocated as hugepages) >> >> - With 5.6.0-rc3 + this patch, with proactiveness=20 >> >> sysctl -w vm.compaction_proactiveness=20 >> >> percentile latency >> –––––––––– ––––––– >> 5 2 >> 10 2 >> 25 3 >> 30 3 >> 40 3 >> 50 4 >> 60 4 >> 75 4 >> 80 4 >> 90 5 >> 95 429 >> >> Total 2M hugepages allocated = 384105 (750G worth of hugepages out of >> 762G total free => 98% of free memory could be allocated as hugepages) >> >> 2. JAVA heap allocation >> >> In this test, we first fragment memory using the same method as for (1). >> >> Then, we start a Java process with a heap size set to 700G and request >> the heap to be allocated with THP hugepages. We also set THP to madvise >> to allow hugepage backing of this heap. >> >> /usr/bin/time >> java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch >> >> The above command allocates 700G of Java heap using hugepages. >> >> - With vanilla 5.6.0-rc3 >> >> 17.39user 1666.48system 27:37.89elapsed >> >> - With 5.6.0-rc3 + this patch, with proactiveness=20 >> >> 8.35user 194.58system 3:19.62elapsed >> >> Elapsed time remains around 3:15, as proactiveness is further increased. >> >> Note that proactive compaction happens throughout the runtime of these >> workloads. The situation of one-time compaction, sufficient to supply >> hugepages for following allocation stream, can probably happen for more >> extreme proactiveness values, like 80 or 90. >> >> In the above Java workload, proactiveness is set to 20. The test starts >> with a node's score of 80 or higher, depending on the delay between the >> fragmentation step and starting the benchmark, which gives more-or-less >> time for the initial round of compaction. As t he benchmark consumes >> hugepages, node's score quickly rises above the high threshold (90) and >> proactive compaction starts again, which brings down the score to the >> low threshold level (80). Repeat. >> >> bpftrace also confirms proactive compaction running 20+ times during the >> runtime of this Java benchmark. kcompactd threads consume 100% of one of >> the CPUs while it tries to bring a node's score within thresholds. >> >> Backoff behavior >> ================ >> >> Above workloads produce a memory state which is easy to compact. >> However, if memory is filled with unmovable pages, proactive compaction >> should essentially back off. To test this aspect: >> >> - Created a kernel driver that allocates almost all memory as hugepages >> followed by freeing first 3/4 of each hugepage. >> - Set proactiveness=40 >> - Note that proactive_compact_node() is deferred maximum number of times >> with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check >> (=> ~30 seconds between retries). >> >> [1] https://patchwork.kernel.org/patch/11098289/ >> [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/ >> [3] https://lwn.net/Articles/817905/ >> >> Signed-off-by: Nitin Gupta <nigupta@nvidia.com> >> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> >> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> >> Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com> >> Tested-by: Oleksandr Natalenko <oleksandr@redhat.com> >> To: Andrew Morton <akpm@linux-foundation.org> >> CC: Vlastimil Babka <vbabka@suse.cz> >> CC: Khalid Aziz <khalid.aziz@oracle.com> >> CC: Michal Hocko <mhocko@suse.com> >> CC: Mel Gorman <mgorman@techsingularity.net> >> CC: Matthew Wilcox <willy@infradead.org> >> CC: Mike Kravetz <mike.kravetz@oracle.com> >> CC: Joonsoo Kim <iamjoonsoo.kim@lge.com> >> CC: David Rientjes <rientjes@google.com> >> CC: Nitin Gupta <ngupta@nitingupta.dev> >> CC: Oleksandr Natalenko <oleksandr@redhat.com> >> CC: linux-kernel <linux-kernel@vger.kernel.org> >> CC: linux-mm <linux-mm@kvack.org> >> CC: Linux API <linux-api@vger.kernel.org> > > This is now in -next and causes the following build failure: > > $ make -skj"$(nproc)" ARCH=mips CROSS_COMPILE=mipsel-linux- O=out/mipsel distclean malta_kvm_guest_defconfig mm/compaction.o > In file included from include/linux/dev_printk.h:14, > from include/linux/device.h:15, > from include/linux/node.h:18, > from include/linux/cpu.h:17, > from mm/compaction.c:11: > In function 'fragmentation_score_zone', > inlined from '__compact_finished' at mm/compaction.c:1982:11, > inlined from 'compact_zone' at mm/compaction.c:2062:8: > include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301' declared with attribute error: BUILD_BUG failed > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > | ^ > include/linux/compiler.h:320:4: note: in definition of macro '__compiletime_assert' > 320 | prefix ## suffix(); \ > | ^~~~~~ > include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime_assert' > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > | ^~~~~~~~~~~~~~~~~~~ > include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' > 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) > | ^~~~~~~~~~~~~~~~~~ > include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG' > 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") > | ^~~~~~~~~~~~~~~~ > arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BUG' > 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) > | ^~~~~~~~~ > mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' > 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER > | ^~~~~~~~~~~~~~~~~~ > mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_ORDER' > 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); > | ^~~~~~~~~~~~~~~~~~~~~~ > In function 'fragmentation_score_zone', > inlined from 'kcompactd' at mm/compaction.c:1918:12: > include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301' declared with attribute error: BUILD_BUG failed > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > | ^ > include/linux/compiler.h:320:4: note: in definition of macro '__compiletime_assert' > 320 | prefix ## suffix(); \ > | ^~~~~~ > include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime_assert' > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > | ^~~~~~~~~~~~~~~~~~~ > include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' > 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) > | ^~~~~~~~~~~~~~~~~~ > include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG' > 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") > | ^~~~~~~~~~~~~~~~ > arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BUG' > 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) > | ^~~~~~~~~ > mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' > 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER > | ^~~~~~~~~~~~~~~~~~ > mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_ORDER' > 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); > | ^~~~~~~~~~~~~~~~~~~~~~ > In function 'fragmentation_score_zone', > inlined from 'kcompactd' at mm/compaction.c:1918:12: > include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301' declared with attribute error: BUILD_BUG failed > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > | ^ > include/linux/compiler.h:320:4: note: in definition of macro '__compiletime_assert' > 320 | prefix ## suffix(); \ > | ^~~~~~ > include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime_assert' > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > | ^~~~~~~~~~~~~~~~~~~ > include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' > 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) > | ^~~~~~~~~~~~~~~~~~ > include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG' > 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") > | ^~~~~~~~~~~~~~~~ > arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BUG' > 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) > | ^~~~~~~~~ > mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' > 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER > | ^~~~~~~~~~~~~~~~~~ > mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_ORDER' > 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); > | ^~~~~~~~~~~~~~~~~~~~~~ > In function 'fragmentation_score_zone', > inlined from 'kcompactd' at mm/compaction.c:1918:12: > include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301' declared with attribute error: BUILD_BUG failed > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > | ^ > include/linux/compiler.h:320:4: note: in definition of macro '__compiletime_assert' > 320 | prefix ## suffix(); \ > | ^~~~~~ > include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime_assert' > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > | ^~~~~~~~~~~~~~~~~~~ > include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' > 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) > | ^~~~~~~~~~~~~~~~~~ > include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG' > 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") > | ^~~~~~~~~~~~~~~~ > arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BUG' > 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) > | ^~~~~~~~~ > mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' > 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER > | ^~~~~~~~~~~~~~~~~~ > mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_ORDER' > 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); > | ^~~~~~~~~~~~~~~~~~~~~~ > make[3]: *** [scripts/Makefile.build:281: mm/compaction.o] Error 1 > make[3]: Target '__build' not remade because of errors. > make[2]: *** [Makefile:1765: mm] Error 2 > make[2]: Target 'mm/compaction.o' not remade because of errors. > make[1]: *** [Makefile:336: __build_one_by_one] Error 2 > make[1]: Target 'distclean' not remade because of errors. > make[1]: Target 'malta_kvm_guest_defconfig' not remade because of errors. > make[1]: Target 'mm/compaction.o' not remade because of errors. > make: *** [Makefile:185: __sub-make] Error 2 > make: Target 'distclean' not remade because of errors. > make: Target 'malta_kvm_guest_defconfig' not remade because of errors. > make: Target 'mm/compaction.o' not remade because of errors. > > I am not sure why MIPS is special with its handling of hugepage support > but I am far from a MIPS expert :) it seems that both HUGETLB_PAGE and TRANSPARENT_HUGEPAGE are disabled with malta_kvm_guest_defconfig. > > Cheers, > Nathan >
On 6/22/20 7:26 PM, Nathan Chancellor wrote: > On Tue, Jun 16, 2020 at 01:45:27PM -0700, Nitin Gupta wrote: >> For some applications, we need to allocate almost all memory as >> hugepages. However, on a running system, higher-order allocations can >> fail if the memory is fragmented. Linux kernel currently does on-demand >> compaction as we request more hugepages, but this style of compaction >> incurs very high latency. Experiments with one-time full memory >> compaction (followed by hugepage allocations) show that kernel is able >> to restore a highly fragmented memory state to a fairly compacted memory >> state within <1 sec for a 32G system. Such data suggests that a more >> proactive compaction can help us allocate a large fraction of memory as >> hugepages keeping allocation latencies low. >> >> For a more proactive compaction, the approach taken here is to define a >> new sysctl called 'vm.compaction_proactiveness' which dictates bounds >> for external fragmentation which kcompactd tries to maintain. >> >> The tunable takes a value in range [0, 100], with a default of 20. >> >> Note that a previous version of this patch [1] was found to introduce >> too many tunables (per-order extfrag{low, high}), but this one reduces >> them to just one sysctl. Also, the new tunable is an opaque value >> instead of asking for specific bounds of "external fragmentation", which >> would have been difficult to estimate. The internal interpretation of >> this opaque value allows for future fine-tuning. >> >> Currently, we use a simple translation from this tunable to [low, high] >> "fragmentation score" thresholds (low=100-proactiveness, high=low+10%). >> The score for a node is defined as weighted mean of per-zone external >> fragmentation. A zone's present_pages determines its weight. >> >> To periodically check per-node score, we reuse per-node kcompactd >> threads, which are woken up every 500 milliseconds to check the same. If >> a node's score exceeds its high threshold (as derived from user-provided >> proactiveness value), proactive compaction is started until its score >> reaches its low threshold value. By default, proactiveness is set to 20, >> which implies threshold values of low=80 and high=90. >> >> This patch is largely based on ideas from Michal Hocko [2]. See also the >> LWN article [3]. >> >> Performance data >> ================ >> >> System: x64_64, 1T RAM, 80 CPU threads. >> Kernel: 5.6.0-rc3 + this patch >> >> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled >> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag >> >> Before starting the driver, the system was fragmented from a userspace >> program that allocates all memory and then for each 2M aligned section, >> frees 3/4 of base pages using munmap. The workload is mainly anonymous >> userspace pages, which are easy to move around. I intentionally avoided >> unmovable pages in this test to see how much latency we incur when >> hugepage allocations hit direct compaction. >> >> 1. Kernel hugepage allocation latencies >> >> With the system in such a fragmented state, a kernel driver then >> allocates as many hugepages as possible and measures allocation >> latency: >> >> (all latency values are in microseconds) >> >> - With vanilla 5.6.0-rc3 >> >> percentile latency >> –––––––––– ––––––– >> 5 7894 >> 10 9496 >> 25 12561 >> 30 15295 >> 40 18244 >> 50 21229 >> 60 27556 >> 75 30147 >> 80 31047 >> 90 32859 >> 95 33799 >> >> Total 2M hugepages allocated = 383859 (749G worth of hugepages out of >> 762G total free => 98% of free memory could be allocated as hugepages) >> >> - With 5.6.0-rc3 + this patch, with proactiveness=20 >> >> sysctl -w vm.compaction_proactiveness=20 >> >> percentile latency >> –––––––––– ––––––– >> 5 2 >> 10 2 >> 25 3 >> 30 3 >> 40 3 >> 50 4 >> 60 4 >> 75 4 >> 80 4 >> 90 5 >> 95 429 >> >> Total 2M hugepages allocated = 384105 (750G worth of hugepages out of >> 762G total free => 98% of free memory could be allocated as hugepages) >> >> 2. JAVA heap allocation >> >> In this test, we first fragment memory using the same method as for (1). >> >> Then, we start a Java process with a heap size set to 700G and request >> the heap to be allocated with THP hugepages. We also set THP to madvise >> to allow hugepage backing of this heap. >> >> /usr/bin/time >> java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch >> >> The above command allocates 700G of Java heap using hugepages. >> >> - With vanilla 5.6.0-rc3 >> >> 17.39user 1666.48system 27:37.89elapsed >> >> - With 5.6.0-rc3 + this patch, with proactiveness=20 >> >> 8.35user 194.58system 3:19.62elapsed >> >> Elapsed time remains around 3:15, as proactiveness is further increased. >> >> Note that proactive compaction happens throughout the runtime of these >> workloads. The situation of one-time compaction, sufficient to supply >> hugepages for following allocation stream, can probably happen for more >> extreme proactiveness values, like 80 or 90. >> >> In the above Java workload, proactiveness is set to 20. The test starts >> with a node's score of 80 or higher, depending on the delay between the >> fragmentation step and starting the benchmark, which gives more-or-less >> time for the initial round of compaction. As t he benchmark consumes >> hugepages, node's score quickly rises above the high threshold (90) and >> proactive compaction starts again, which brings down the score to the >> low threshold level (80). Repeat. >> >> bpftrace also confirms proactive compaction running 20+ times during the >> runtime of this Java benchmark. kcompactd threads consume 100% of one of >> the CPUs while it tries to bring a node's score within thresholds. >> >> Backoff behavior >> ================ >> >> Above workloads produce a memory state which is easy to compact. >> However, if memory is filled with unmovable pages, proactive compaction >> should essentially back off. To test this aspect: >> >> - Created a kernel driver that allocates almost all memory as hugepages >> followed by freeing first 3/4 of each hugepage. >> - Set proactiveness=40 >> - Note that proactive_compact_node() is deferred maximum number of times >> with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check >> (=> ~30 seconds between retries). >> >> [1] https://patchwork.kernel.org/patch/11098289/ >> [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/ >> [3] https://lwn.net/Articles/817905/ >> >> Signed-off-by: Nitin Gupta <nigupta@nvidia.com> >> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> >> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> >> Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com> >> Tested-by: Oleksandr Natalenko <oleksandr@redhat.com> >> To: Andrew Morton <akpm@linux-foundation.org> >> CC: Vlastimil Babka <vbabka@suse.cz> >> CC: Khalid Aziz <khalid.aziz@oracle.com> >> CC: Michal Hocko <mhocko@suse.com> >> CC: Mel Gorman <mgorman@techsingularity.net> >> CC: Matthew Wilcox <willy@infradead.org> >> CC: Mike Kravetz <mike.kravetz@oracle.com> >> CC: Joonsoo Kim <iamjoonsoo.kim@lge.com> >> CC: David Rientjes <rientjes@google.com> >> CC: Nitin Gupta <ngupta@nitingupta.dev> >> CC: Oleksandr Natalenko <oleksandr@redhat.com> >> CC: linux-kernel <linux-kernel@vger.kernel.org> >> CC: linux-mm <linux-mm@kvack.org> >> CC: Linux API <linux-api@vger.kernel.org> > > This is now in -next and causes the following build failure: > > $ make -skj"$(nproc)" ARCH=mips CROSS_COMPILE=mipsel-linux- O=out/mipsel distclean malta_kvm_guest_defconfig mm/compaction.o > In file included from include/linux/dev_printk.h:14, > from include/linux/device.h:15, > from include/linux/node.h:18, > from include/linux/cpu.h:17, > from mm/compaction.c:11: > In function 'fragmentation_score_zone', > inlined from '__compact_finished' at mm/compaction.c:1982:11, > inlined from 'compact_zone' at mm/compaction.c:2062:8: > include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301' declared with attribute error: BUILD_BUG failed > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > | ^ > include/linux/compiler.h:320:4: note: in definition of macro '__compiletime_assert' > 320 | prefix ## suffix(); \ > | ^~~~~~ > include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime_assert' > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > | ^~~~~~~~~~~~~~~~~~~ > include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' > 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) > | ^~~~~~~~~~~~~~~~~~ > include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG' > 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") > | ^~~~~~~~~~~~~~~~ > arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BUG' > 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) > | ^~~~~~~~~ > mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' > 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER > | ^~~~~~~~~~~~~~~~~~ > mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_ORDER' > 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); > | ^~~~~~~~~~~~~~~~~~~~~~ > In function 'fragmentation_score_zone', > inlined from 'kcompactd' at mm/compaction.c:1918:12: > include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301' declared with attribute error: BUILD_BUG failed > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > | ^ > include/linux/compiler.h:320:4: note: in definition of macro '__compiletime_assert' > 320 | prefix ## suffix(); \ > | ^~~~~~ > include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime_assert' > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > | ^~~~~~~~~~~~~~~~~~~ > include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' > 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) > | ^~~~~~~~~~~~~~~~~~ > include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG' > 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") > | ^~~~~~~~~~~~~~~~ > arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BUG' > 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) > | ^~~~~~~~~ > mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' > 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER > | ^~~~~~~~~~~~~~~~~~ > mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_ORDER' > 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); > | ^~~~~~~~~~~~~~~~~~~~~~ > In function 'fragmentation_score_zone', > inlined from 'kcompactd' at mm/compaction.c:1918:12: > include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301' declared with attribute error: BUILD_BUG failed > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > | ^ > include/linux/compiler.h:320:4: note: in definition of macro '__compiletime_assert' > 320 | prefix ## suffix(); \ > | ^~~~~~ > include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime_assert' > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > | ^~~~~~~~~~~~~~~~~~~ > include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' > 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) > | ^~~~~~~~~~~~~~~~~~ > include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG' > 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") > | ^~~~~~~~~~~~~~~~ > arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BUG' > 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) > | ^~~~~~~~~ > mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' > 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER > | ^~~~~~~~~~~~~~~~~~ > mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_ORDER' > 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); > | ^~~~~~~~~~~~~~~~~~~~~~ > In function 'fragmentation_score_zone', > inlined from 'kcompactd' at mm/compaction.c:1918:12: > include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301' declared with attribute error: BUILD_BUG failed > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > | ^ > include/linux/compiler.h:320:4: note: in definition of macro '__compiletime_assert' > 320 | prefix ## suffix(); \ > | ^~~~~~ > include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime_assert' > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > | ^~~~~~~~~~~~~~~~~~~ > include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' > 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) > | ^~~~~~~~~~~~~~~~~~ > include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG' > 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") > | ^~~~~~~~~~~~~~~~ > arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BUG' > 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) > | ^~~~~~~~~ > mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' > 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER > | ^~~~~~~~~~~~~~~~~~ > mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_ORDER' > 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); > | ^~~~~~~~~~~~~~~~~~~~~~ > make[3]: *** [scripts/Makefile.build:281: mm/compaction.o] Error 1 > make[3]: Target '__build' not remade because of errors. > make[2]: *** [Makefile:1765: mm] Error 2 > make[2]: Target 'mm/compaction.o' not remade because of errors. > make[1]: *** [Makefile:336: __build_one_by_one] Error 2 > make[1]: Target 'distclean' not remade because of errors. > make[1]: Target 'malta_kvm_guest_defconfig' not remade because of errors. > make[1]: Target 'mm/compaction.o' not remade because of errors. > make: *** [Makefile:185: __sub-make] Error 2 > make: Target 'distclean' not remade because of errors. > make: Target 'malta_kvm_guest_defconfig' not remade because of errors. > make: Target 'mm/compaction.o' not remade because of errors. > > I am not sure why MIPS is special with its handling of hugepage support > but I am far from a MIPS expert :) > Can you check if this patch fixes the compile error: diff --git a/mm/compaction.c b/mm/compaction.c index 45fd24a0ea0b..02963ffb9e70 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -62,7 +62,7 @@ static const unsigned int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500; */ #if defined CONFIG_TRANSPARENT_HUGEPAGE #define COMPACTION_HPAGE_ORDER HPAGE_PMD_ORDER -#elif defined HUGETLB_PAGE_ORDER +#elif defined CONFIG_HUGETLBFS #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER #else #define COMPACTION_HPAGE_ORDER (PMD_SHIFT - PAGE_SHIFT)
On Mon, Jun 22, 2020 at 09:32:12PM -0700, Nitin Gupta wrote: > On 6/22/20 7:26 PM, Nathan Chancellor wrote: > > On Tue, Jun 16, 2020 at 01:45:27PM -0700, Nitin Gupta wrote: > >> For some applications, we need to allocate almost all memory as > >> hugepages. However, on a running system, higher-order allocations can > >> fail if the memory is fragmented. Linux kernel currently does on-demand > >> compaction as we request more hugepages, but this style of compaction > >> incurs very high latency. Experiments with one-time full memory > >> compaction (followed by hugepage allocations) show that kernel is able > >> to restore a highly fragmented memory state to a fairly compacted memory > >> state within <1 sec for a 32G system. Such data suggests that a more > >> proactive compaction can help us allocate a large fraction of memory as > >> hugepages keeping allocation latencies low. > >> > >> For a more proactive compaction, the approach taken here is to define a > >> new sysctl called 'vm.compaction_proactiveness' which dictates bounds > >> for external fragmentation which kcompactd tries to maintain. > >> > >> The tunable takes a value in range [0, 100], with a default of 20. > >> > >> Note that a previous version of this patch [1] was found to introduce > >> too many tunables (per-order extfrag{low, high}), but this one reduces > >> them to just one sysctl. Also, the new tunable is an opaque value > >> instead of asking for specific bounds of "external fragmentation", which > >> would have been difficult to estimate. The internal interpretation of > >> this opaque value allows for future fine-tuning. > >> > >> Currently, we use a simple translation from this tunable to [low, high] > >> "fragmentation score" thresholds (low=100-proactiveness, high=low+10%). > >> The score for a node is defined as weighted mean of per-zone external > >> fragmentation. A zone's present_pages determines its weight. > >> > >> To periodically check per-node score, we reuse per-node kcompactd > >> threads, which are woken up every 500 milliseconds to check the same. If > >> a node's score exceeds its high threshold (as derived from user-provided > >> proactiveness value), proactive compaction is started until its score > >> reaches its low threshold value. By default, proactiveness is set to 20, > >> which implies threshold values of low=80 and high=90. > >> > >> This patch is largely based on ideas from Michal Hocko [2]. See also the > >> LWN article [3]. > >> > >> Performance data > >> ================ > >> > >> System: x64_64, 1T RAM, 80 CPU threads. > >> Kernel: 5.6.0-rc3 + this patch > >> > >> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled > >> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag > >> > >> Before starting the driver, the system was fragmented from a userspace > >> program that allocates all memory and then for each 2M aligned section, > >> frees 3/4 of base pages using munmap. The workload is mainly anonymous > >> userspace pages, which are easy to move around. I intentionally avoided > >> unmovable pages in this test to see how much latency we incur when > >> hugepage allocations hit direct compaction. > >> > >> 1. Kernel hugepage allocation latencies > >> > >> With the system in such a fragmented state, a kernel driver then > >> allocates as many hugepages as possible and measures allocation > >> latency: > >> > >> (all latency values are in microseconds) > >> > >> - With vanilla 5.6.0-rc3 > >> > >> percentile latency > >> –––––––––– ––––––– > >> 5 7894 > >> 10 9496 > >> 25 12561 > >> 30 15295 > >> 40 18244 > >> 50 21229 > >> 60 27556 > >> 75 30147 > >> 80 31047 > >> 90 32859 > >> 95 33799 > >> > >> Total 2M hugepages allocated = 383859 (749G worth of hugepages out of > >> 762G total free => 98% of free memory could be allocated as hugepages) > >> > >> - With 5.6.0-rc3 + this patch, with proactiveness=20 > >> > >> sysctl -w vm.compaction_proactiveness=20 > >> > >> percentile latency > >> –––––––––– ––––––– > >> 5 2 > >> 10 2 > >> 25 3 > >> 30 3 > >> 40 3 > >> 50 4 > >> 60 4 > >> 75 4 > >> 80 4 > >> 90 5 > >> 95 429 > >> > >> Total 2M hugepages allocated = 384105 (750G worth of hugepages out of > >> 762G total free => 98% of free memory could be allocated as hugepages) > >> > >> 2. JAVA heap allocation > >> > >> In this test, we first fragment memory using the same method as for (1). > >> > >> Then, we start a Java process with a heap size set to 700G and request > >> the heap to be allocated with THP hugepages. We also set THP to madvise > >> to allow hugepage backing of this heap. > >> > >> /usr/bin/time > >> java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch > >> > >> The above command allocates 700G of Java heap using hugepages. > >> > >> - With vanilla 5.6.0-rc3 > >> > >> 17.39user 1666.48system 27:37.89elapsed > >> > >> - With 5.6.0-rc3 + this patch, with proactiveness=20 > >> > >> 8.35user 194.58system 3:19.62elapsed > >> > >> Elapsed time remains around 3:15, as proactiveness is further increased. > >> > >> Note that proactive compaction happens throughout the runtime of these > >> workloads. The situation of one-time compaction, sufficient to supply > >> hugepages for following allocation stream, can probably happen for more > >> extreme proactiveness values, like 80 or 90. > >> > >> In the above Java workload, proactiveness is set to 20. The test starts > >> with a node's score of 80 or higher, depending on the delay between the > >> fragmentation step and starting the benchmark, which gives more-or-less > >> time for the initial round of compaction. As t he benchmark consumes > >> hugepages, node's score quickly rises above the high threshold (90) and > >> proactive compaction starts again, which brings down the score to the > >> low threshold level (80). Repeat. > >> > >> bpftrace also confirms proactive compaction running 20+ times during the > >> runtime of this Java benchmark. kcompactd threads consume 100% of one of > >> the CPUs while it tries to bring a node's score within thresholds. > >> > >> Backoff behavior > >> ================ > >> > >> Above workloads produce a memory state which is easy to compact. > >> However, if memory is filled with unmovable pages, proactive compaction > >> should essentially back off. To test this aspect: > >> > >> - Created a kernel driver that allocates almost all memory as hugepages > >> followed by freeing first 3/4 of each hugepage. > >> - Set proactiveness=40 > >> - Note that proactive_compact_node() is deferred maximum number of times > >> with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check > >> (=> ~30 seconds between retries). > >> > >> [1] https://patchwork.kernel.org/patch/11098289/ > >> [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/ > >> [3] https://lwn.net/Articles/817905/ > >> > >> Signed-off-by: Nitin Gupta <nigupta@nvidia.com> > >> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> > >> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> > >> Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com> > >> Tested-by: Oleksandr Natalenko <oleksandr@redhat.com> > >> To: Andrew Morton <akpm@linux-foundation.org> > >> CC: Vlastimil Babka <vbabka@suse.cz> > >> CC: Khalid Aziz <khalid.aziz@oracle.com> > >> CC: Michal Hocko <mhocko@suse.com> > >> CC: Mel Gorman <mgorman@techsingularity.net> > >> CC: Matthew Wilcox <willy@infradead.org> > >> CC: Mike Kravetz <mike.kravetz@oracle.com> > >> CC: Joonsoo Kim <iamjoonsoo.kim@lge.com> > >> CC: David Rientjes <rientjes@google.com> > >> CC: Nitin Gupta <ngupta@nitingupta.dev> > >> CC: Oleksandr Natalenko <oleksandr@redhat.com> > >> CC: linux-kernel <linux-kernel@vger.kernel.org> > >> CC: linux-mm <linux-mm@kvack.org> > >> CC: Linux API <linux-api@vger.kernel.org> > > > > This is now in -next and causes the following build failure: > > > > $ make -skj"$(nproc)" ARCH=mips CROSS_COMPILE=mipsel-linux- O=out/mipsel distclean malta_kvm_guest_defconfig mm/compaction.o > > In file included from include/linux/dev_printk.h:14, > > from include/linux/device.h:15, > > from include/linux/node.h:18, > > from include/linux/cpu.h:17, > > from mm/compaction.c:11: > > In function 'fragmentation_score_zone', > > inlined from '__compact_finished' at mm/compaction.c:1982:11, > > inlined from 'compact_zone' at mm/compaction.c:2062:8: > > include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301' declared with attribute error: BUILD_BUG failed > > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > > | ^ > > include/linux/compiler.h:320:4: note: in definition of macro '__compiletime_assert' > > 320 | prefix ## suffix(); \ > > | ^~~~~~ > > include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime_assert' > > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > > | ^~~~~~~~~~~~~~~~~~~ > > include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' > > 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) > > | ^~~~~~~~~~~~~~~~~~ > > include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG' > > 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") > > | ^~~~~~~~~~~~~~~~ > > arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BUG' > > 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) > > | ^~~~~~~~~ > > mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' > > 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER > > | ^~~~~~~~~~~~~~~~~~ > > mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_ORDER' > > 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); > > | ^~~~~~~~~~~~~~~~~~~~~~ > > In function 'fragmentation_score_zone', > > inlined from 'kcompactd' at mm/compaction.c:1918:12: > > include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301' declared with attribute error: BUILD_BUG failed > > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > > | ^ > > include/linux/compiler.h:320:4: note: in definition of macro '__compiletime_assert' > > 320 | prefix ## suffix(); \ > > | ^~~~~~ > > include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime_assert' > > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > > | ^~~~~~~~~~~~~~~~~~~ > > include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' > > 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) > > | ^~~~~~~~~~~~~~~~~~ > > include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG' > > 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") > > | ^~~~~~~~~~~~~~~~ > > arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BUG' > > 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) > > | ^~~~~~~~~ > > mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' > > 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER > > | ^~~~~~~~~~~~~~~~~~ > > mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_ORDER' > > 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); > > | ^~~~~~~~~~~~~~~~~~~~~~ > > In function 'fragmentation_score_zone', > > inlined from 'kcompactd' at mm/compaction.c:1918:12: > > include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301' declared with attribute error: BUILD_BUG failed > > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > > | ^ > > include/linux/compiler.h:320:4: note: in definition of macro '__compiletime_assert' > > 320 | prefix ## suffix(); \ > > | ^~~~~~ > > include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime_assert' > > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > > | ^~~~~~~~~~~~~~~~~~~ > > include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' > > 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) > > | ^~~~~~~~~~~~~~~~~~ > > include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG' > > 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") > > | ^~~~~~~~~~~~~~~~ > > arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BUG' > > 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) > > | ^~~~~~~~~ > > mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' > > 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER > > | ^~~~~~~~~~~~~~~~~~ > > mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_ORDER' > > 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); > > | ^~~~~~~~~~~~~~~~~~~~~~ > > In function 'fragmentation_score_zone', > > inlined from 'kcompactd' at mm/compaction.c:1918:12: > > include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301' declared with attribute error: BUILD_BUG failed > > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > > | ^ > > include/linux/compiler.h:320:4: note: in definition of macro '__compiletime_assert' > > 320 | prefix ## suffix(); \ > > | ^~~~~~ > > include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime_assert' > > 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) > > | ^~~~~~~~~~~~~~~~~~~ > > include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' > > 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) > > | ^~~~~~~~~~~~~~~~~~ > > include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG' > > 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") > > | ^~~~~~~~~~~~~~~~ > > arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BUG' > > 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) > > | ^~~~~~~~~ > > mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' > > 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER > > | ^~~~~~~~~~~~~~~~~~ > > mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_ORDER' > > 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); > > | ^~~~~~~~~~~~~~~~~~~~~~ > > make[3]: *** [scripts/Makefile.build:281: mm/compaction.o] Error 1 > > make[3]: Target '__build' not remade because of errors. > > make[2]: *** [Makefile:1765: mm] Error 2 > > make[2]: Target 'mm/compaction.o' not remade because of errors. > > make[1]: *** [Makefile:336: __build_one_by_one] Error 2 > > make[1]: Target 'distclean' not remade because of errors. > > make[1]: Target 'malta_kvm_guest_defconfig' not remade because of errors. > > make[1]: Target 'mm/compaction.o' not remade because of errors. > > make: *** [Makefile:185: __sub-make] Error 2 > > make: Target 'distclean' not remade because of errors. > > make: Target 'malta_kvm_guest_defconfig' not remade because of errors. > > make: Target 'mm/compaction.o' not remade because of errors. > > > > I am not sure why MIPS is special with its handling of hugepage support > > but I am far from a MIPS expert :) > > > > Can you check if this patch fixes the compile error: > > > diff --git a/mm/compaction.c b/mm/compaction.c > index 45fd24a0ea0b..02963ffb9e70 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -62,7 +62,7 @@ static const unsigned int > HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500; > */ > #if defined CONFIG_TRANSPARENT_HUGEPAGE > #define COMPACTION_HPAGE_ORDER HPAGE_PMD_ORDER > -#elif defined HUGETLB_PAGE_ORDER > +#elif defined CONFIG_HUGETLBFS > #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER > #else > #define COMPACTION_HPAGE_ORDER (PMD_SHIFT - PAGE_SHIFT) > > > > Tested-by: Nathan Chancellor <natechancellor@gmail.com> # build
On 6/22/20 9:57 PM, Nathan Chancellor wrote: > On Mon, Jun 22, 2020 at 09:32:12PM -0700, Nitin Gupta wrote: >> On 6/22/20 7:26 PM, Nathan Chancellor wrote: >>> On Tue, Jun 16, 2020 at 01:45:27PM -0700, Nitin Gupta wrote: >>>> For some applications, we need to allocate almost all memory as >>>> hugepages. However, on a running system, higher-order allocations can >>>> fail if the memory is fragmented. Linux kernel currently does on-demand >>>> compaction as we request more hugepages, but this style of compaction >>>> incurs very high latency. Experiments with one-time full memory >>>> compaction (followed by hugepage allocations) show that kernel is able >>>> to restore a highly fragmented memory state to a fairly compacted memory >>>> state within <1 sec for a 32G system. Such data suggests that a more >>>> proactive compaction can help us allocate a large fraction of memory as >>>> hugepages keeping allocation latencies low. >>>> >>>> For a more proactive compaction, the approach taken here is to define a >>>> new sysctl called 'vm.compaction_proactiveness' which dictates bounds >>>> for external fragmentation which kcompactd tries to maintain. >>>> >>>> The tunable takes a value in range [0, 100], with a default of 20. >>>> >>>> Note that a previous version of this patch [1] was found to introduce >>>> too many tunables (per-order extfrag{low, high}), but this one reduces >>>> them to just one sysctl. Also, the new tunable is an opaque value >>>> instead of asking for specific bounds of "external fragmentation", which >>>> would have been difficult to estimate. The internal interpretation of >>>> this opaque value allows for future fine-tuning. >>>> >>>> Currently, we use a simple translation from this tunable to [low, high] >>>> "fragmentation score" thresholds (low=100-proactiveness, high=low+10%). >>>> The score for a node is defined as weighted mean of per-zone external >>>> fragmentation. A zone's present_pages determines its weight. >>>> >>>> To periodically check per-node score, we reuse per-node kcompactd >>>> threads, which are woken up every 500 milliseconds to check the same. If >>>> a node's score exceeds its high threshold (as derived from user-provided >>>> proactiveness value), proactive compaction is started until its score >>>> reaches its low threshold value. By default, proactiveness is set to 20, >>>> which implies threshold values of low=80 and high=90. >>>> >>>> This patch is largely based on ideas from Michal Hocko [2]. See also the >>>> LWN article [3]. >>>> >>>> Performance data >>>> ================ >>>> >>>> System: x64_64, 1T RAM, 80 CPU threads. >>>> Kernel: 5.6.0-rc3 + this patch >>>> >>>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled >>>> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag >>>> >>>> Before starting the driver, the system was fragmented from a userspace >>>> program that allocates all memory and then for each 2M aligned section, >>>> frees 3/4 of base pages using munmap. The workload is mainly anonymous >>>> userspace pages, which are easy to move around. I intentionally avoided >>>> unmovable pages in this test to see how much latency we incur when >>>> hugepage allocations hit direct compaction. >>>> >>>> 1. Kernel hugepage allocation latencies >>>> >>>> With the system in such a fragmented state, a kernel driver then >>>> allocates as many hugepages as possible and measures allocation >>>> latency: >>>> >>>> (all latency values are in microseconds) >>>> >>>> - With vanilla 5.6.0-rc3 >>>> >>>> percentile latency >>>> –––––––––– ––––––– >>>> 5 7894 >>>> 10 9496 >>>> 25 12561 >>>> 30 15295 >>>> 40 18244 >>>> 50 21229 >>>> 60 27556 >>>> 75 30147 >>>> 80 31047 >>>> 90 32859 >>>> 95 33799 >>>> >>>> Total 2M hugepages allocated = 383859 (749G worth of hugepages out of >>>> 762G total free => 98% of free memory could be allocated as hugepages) >>>> >>>> - With 5.6.0-rc3 + this patch, with proactiveness=20 >>>> >>>> sysctl -w vm.compaction_proactiveness=20 >>>> >>>> percentile latency >>>> –––––––––– ––––––– >>>> 5 2 >>>> 10 2 >>>> 25 3 >>>> 30 3 >>>> 40 3 >>>> 50 4 >>>> 60 4 >>>> 75 4 >>>> 80 4 >>>> 90 5 >>>> 95 429 >>>> >>>> Total 2M hugepages allocated = 384105 (750G worth of hugepages out of >>>> 762G total free => 98% of free memory could be allocated as hugepages) >>>> >>>> 2. JAVA heap allocation >>>> >>>> In this test, we first fragment memory using the same method as for (1). >>>> >>>> Then, we start a Java process with a heap size set to 700G and request >>>> the heap to be allocated with THP hugepages. We also set THP to madvise >>>> to allow hugepage backing of this heap. >>>> >>>> /usr/bin/time >>>> java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch >>>> >>>> The above command allocates 700G of Java heap using hugepages. >>>> >>>> - With vanilla 5.6.0-rc3 >>>> >>>> 17.39user 1666.48system 27:37.89elapsed >>>> >>>> - With 5.6.0-rc3 + this patch, with proactiveness=20 >>>> >>>> 8.35user 194.58system 3:19.62elapsed >>>> >>>> Elapsed time remains around 3:15, as proactiveness is further increased. >>>> >>>> Note that proactive compaction happens throughout the runtime of these >>>> workloads. The situation of one-time compaction, sufficient to supply >>>> hugepages for following allocation stream, can probably happen for more >>>> extreme proactiveness values, like 80 or 90. >>>> >>>> In the above Java workload, proactiveness is set to 20. The test starts >>>> with a node's score of 80 or higher, depending on the delay between the >>>> fragmentation step and starting the benchmark, which gives more-or-less >>>> time for the initial round of compaction. As t he benchmark consumes >>>> hugepages, node's score quickly rises above the high threshold (90) and >>>> proactive compaction starts again, which brings down the score to the >>>> low threshold level (80). Repeat. >>>> >>>> bpftrace also confirms proactive compaction running 20+ times during the >>>> runtime of this Java benchmark. kcompactd threads consume 100% of one of >>>> the CPUs while it tries to bring a node's score within thresholds. >>>> >>>> Backoff behavior >>>> ================ >>>> >>>> Above workloads produce a memory state which is easy to compact. >>>> However, if memory is filled with unmovable pages, proactive compaction >>>> should essentially back off. To test this aspect: >>>> >>>> - Created a kernel driver that allocates almost all memory as hugepages >>>> followed by freeing first 3/4 of each hugepage. >>>> - Set proactiveness=40 >>>> - Note that proactive_compact_node() is deferred maximum number of times >>>> with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check >>>> (=> ~30 seconds between retries). >>>> >>>> [1] https://patchwork.kernel.org/patch/11098289/ >>>> [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/ >>>> [3] https://lwn.net/Articles/817905/ >>>> >>>> Signed-off-by: Nitin Gupta <nigupta@nvidia.com> >>>> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> >>>> Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com> >>>> Reviewed-by: Oleksandr Natalenko <oleksandr@redhat.com> >>>> Tested-by: Oleksandr Natalenko <oleksandr@redhat.com> >>>> To: Andrew Morton <akpm@linux-foundation.org> >>>> CC: Vlastimil Babka <vbabka@suse.cz> >>>> CC: Khalid Aziz <khalid.aziz@oracle.com> >>>> CC: Michal Hocko <mhocko@suse.com> >>>> CC: Mel Gorman <mgorman@techsingularity.net> >>>> CC: Matthew Wilcox <willy@infradead.org> >>>> CC: Mike Kravetz <mike.kravetz@oracle.com> >>>> CC: Joonsoo Kim <iamjoonsoo.kim@lge.com> >>>> CC: David Rientjes <rientjes@google.com> >>>> CC: Nitin Gupta <ngupta@nitingupta.dev> >>>> CC: Oleksandr Natalenko <oleksandr@redhat.com> >>>> CC: linux-kernel <linux-kernel@vger.kernel.org> >>>> CC: linux-mm <linux-mm@kvack.org> >>>> CC: Linux API <linux-api@vger.kernel.org> >>> >>> This is now in -next and causes the following build failure: >>> >>> $ make -skj"$(nproc)" ARCH=mips CROSS_COMPILE=mipsel-linux- O=out/mipsel distclean malta_kvm_guest_defconfig mm/compaction.o >>> In file included from include/linux/dev_printk.h:14, >>> from include/linux/device.h:15, >>> from include/linux/node.h:18, >>> from include/linux/cpu.h:17, >>> from mm/compaction.c:11: >>> In function 'fragmentation_score_zone', >>> inlined from '__compact_finished' at mm/compaction.c:1982:11, >>> inlined from 'compact_zone' at mm/compaction.c:2062:8: >>> include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301' declared with attribute error: BUILD_BUG failed >>> 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) >>> | ^ >>> include/linux/compiler.h:320:4: note: in definition of macro '__compiletime_assert' >>> 320 | prefix ## suffix(); \ >>> | ^~~~~~ >>> include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime_assert' >>> 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) >>> | ^~~~~~~~~~~~~~~~~~~ >>> include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' >>> 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) >>> | ^~~~~~~~~~~~~~~~~~ >>> include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG' >>> 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") >>> | ^~~~~~~~~~~~~~~~ >>> arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BUG' >>> 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) >>> | ^~~~~~~~~ >>> mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' >>> 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER >>> | ^~~~~~~~~~~~~~~~~~ >>> mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_ORDER' >>> 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); >>> | ^~~~~~~~~~~~~~~~~~~~~~ >>> In function 'fragmentation_score_zone', >>> inlined from 'kcompactd' at mm/compaction.c:1918:12: >>> include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301' declared with attribute error: BUILD_BUG failed >>> 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) >>> | ^ >>> include/linux/compiler.h:320:4: note: in definition of macro '__compiletime_assert' >>> 320 | prefix ## suffix(); \ >>> | ^~~~~~ >>> include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime_assert' >>> 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) >>> | ^~~~~~~~~~~~~~~~~~~ >>> include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' >>> 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) >>> | ^~~~~~~~~~~~~~~~~~ >>> include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG' >>> 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") >>> | ^~~~~~~~~~~~~~~~ >>> arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BUG' >>> 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) >>> | ^~~~~~~~~ >>> mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' >>> 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER >>> | ^~~~~~~~~~~~~~~~~~ >>> mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_ORDER' >>> 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); >>> | ^~~~~~~~~~~~~~~~~~~~~~ >>> In function 'fragmentation_score_zone', >>> inlined from 'kcompactd' at mm/compaction.c:1918:12: >>> include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301' declared with attribute error: BUILD_BUG failed >>> 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) >>> | ^ >>> include/linux/compiler.h:320:4: note: in definition of macro '__compiletime_assert' >>> 320 | prefix ## suffix(); \ >>> | ^~~~~~ >>> include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime_assert' >>> 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) >>> | ^~~~~~~~~~~~~~~~~~~ >>> include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' >>> 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) >>> | ^~~~~~~~~~~~~~~~~~ >>> include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG' >>> 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") >>> | ^~~~~~~~~~~~~~~~ >>> arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BUG' >>> 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) >>> | ^~~~~~~~~ >>> mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' >>> 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER >>> | ^~~~~~~~~~~~~~~~~~ >>> mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_ORDER' >>> 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); >>> | ^~~~~~~~~~~~~~~~~~~~~~ >>> In function 'fragmentation_score_zone', >>> inlined from 'kcompactd' at mm/compaction.c:1918:12: >>> include/linux/compiler.h:339:38: error: call to '__compiletime_assert_301' declared with attribute error: BUILD_BUG failed >>> 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) >>> | ^ >>> include/linux/compiler.h:320:4: note: in definition of macro '__compiletime_assert' >>> 320 | prefix ## suffix(); \ >>> | ^~~~~~ >>> include/linux/compiler.h:339:2: note: in expansion of macro '_compiletime_assert' >>> 339 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) >>> | ^~~~~~~~~~~~~~~~~~~ >>> include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' >>> 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) >>> | ^~~~~~~~~~~~~~~~~~ >>> include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG' >>> 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") >>> | ^~~~~~~~~~~~~~~~ >>> arch/mips/include/asm/page.h:70:30: note: in expansion of macro 'BUILD_BUG' >>> 70 | #define HUGETLB_PAGE_ORDER ({BUILD_BUG(); 0; }) >>> | ^~~~~~~~~ >>> mm/compaction.c:66:32: note: in expansion of macro 'HUGETLB_PAGE_ORDER' >>> 66 | #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER >>> | ^~~~~~~~~~~~~~~~~~ >>> mm/compaction.c:1898:28: note: in expansion of macro 'COMPACTION_HPAGE_ORDER' >>> 1898 | extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); >>> | ^~~~~~~~~~~~~~~~~~~~~~ >>> make[3]: *** [scripts/Makefile.build:281: mm/compaction.o] Error 1 >>> make[3]: Target '__build' not remade because of errors. >>> make[2]: *** [Makefile:1765: mm] Error 2 >>> make[2]: Target 'mm/compaction.o' not remade because of errors. >>> make[1]: *** [Makefile:336: __build_one_by_one] Error 2 >>> make[1]: Target 'distclean' not remade because of errors. >>> make[1]: Target 'malta_kvm_guest_defconfig' not remade because of errors. >>> make[1]: Target 'mm/compaction.o' not remade because of errors. >>> make: *** [Makefile:185: __sub-make] Error 2 >>> make: Target 'distclean' not remade because of errors. >>> make: Target 'malta_kvm_guest_defconfig' not remade because of errors. >>> make: Target 'mm/compaction.o' not remade because of errors. >>> >>> I am not sure why MIPS is special with its handling of hugepage support >>> but I am far from a MIPS expert :) >>> >> >> Can you check if this patch fixes the compile error: >> >> >> diff --git a/mm/compaction.c b/mm/compaction.c >> index 45fd24a0ea0b..02963ffb9e70 100644 >> --- a/mm/compaction.c >> +++ b/mm/compaction.c >> @@ -62,7 +62,7 @@ static const unsigned int >> HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500; >> */ >> #if defined CONFIG_TRANSPARENT_HUGEPAGE >> #define COMPACTION_HPAGE_ORDER HPAGE_PMD_ORDER >> -#elif defined HUGETLB_PAGE_ORDER >> +#elif defined CONFIG_HUGETLBFS >> #define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER >> #else >> #define COMPACTION_HPAGE_ORDER (PMD_SHIFT - PAGE_SHIFT) >> >> >> >> > > Tested-by: Nathan Chancellor <natechancellor@gmail.com> # build > Thanks. I will send out a patch with this fix soon. Nitin
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index d46d5b7013c6..4b7c496199ca 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -119,6 +119,21 @@ all zones are compacted such that free memory is available in contiguous blocks where possible. This can be important for example in the allocation of huge pages although processes will also directly compact memory as required. +compaction_proactiveness +======================== + +This tunable takes a value in the range [0, 100] with a default value of +20. This tunable determines how aggressively compaction is done in the +background. Setting it to 0 disables proactive compaction. + +Note that compaction has a non-trivial system-wide impact as pages +belonging to different processes are moved around, which could also lead +to latency spikes in unsuspecting applications. The kernel employs +various heuristics to avoid wasting CPU cycles if it detects that +proactive compaction is not being effective. + +Be careful when setting it to extreme values like 100, as that may +cause excessive background compaction activity. compact_unevictable_allowed =========================== diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 6fa0eea3f530..7a242d46454e 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -85,11 +85,13 @@ static inline unsigned long compact_gap(unsigned int order) #ifdef CONFIG_COMPACTION extern int sysctl_compact_memory; +extern int sysctl_compaction_proactiveness; extern int sysctl_compaction_handler(struct ctl_table *table, int write, void *buffer, size_t *length, loff_t *ppos); extern int sysctl_extfrag_threshold; extern int sysctl_compact_unevictable_allowed; +extern int extfrag_for_order(struct zone *zone, unsigned int order); extern int fragmentation_index(struct zone *zone, unsigned int order); extern enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, unsigned int alloc_flags, diff --git a/kernel/sysctl.c b/kernel/sysctl.c index db1ce7af2563..58b0a59c9769 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -2830,6 +2830,15 @@ static struct ctl_table vm_table[] = { .mode = 0200, .proc_handler = sysctl_compaction_handler, }, + { + .procname = "compaction_proactiveness", + .data = &sysctl_compaction_proactiveness, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = &one_hundred, + }, { .procname = "extfrag_threshold", .data = &sysctl_extfrag_threshold, diff --git a/mm/compaction.c b/mm/compaction.c index fd988b7e5f2b..ac2030814edb 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -50,6 +50,24 @@ static inline void count_compact_events(enum vm_event_item item, long delta) #define pageblock_start_pfn(pfn) block_start_pfn(pfn, pageblock_order) #define pageblock_end_pfn(pfn) block_end_pfn(pfn, pageblock_order) +/* + * Fragmentation score check interval for proactive compaction purposes. + */ +static const int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500; + +/* + * Page order with-respect-to which proactive compaction + * calculates external fragmentation, which is used as + * the "fragmentation score" of a node/zone. + */ +#if defined CONFIG_TRANSPARENT_HUGEPAGE +#define COMPACTION_HPAGE_ORDER HPAGE_PMD_ORDER +#elif defined HUGETLB_PAGE_ORDER +#define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER +#else +#define COMPACTION_HPAGE_ORDER (PMD_SHIFT - PAGE_SHIFT) +#endif + static unsigned long release_freepages(struct list_head *freelist) { struct page *page, *next; @@ -1857,6 +1875,76 @@ static inline bool is_via_compact_memory(int order) return order == -1; } +static bool kswapd_is_running(pg_data_t *pgdat) +{ + return pgdat->kswapd && (pgdat->kswapd->state == TASK_RUNNING); +} + +/* + * A zone's fragmentation score is the external fragmentation wrt to the + * COMPACTION_HPAGE_ORDER scaled by the zone's size. It returns a value + * in the range [0, 100]. + * + * The scaling factor ensures that proactive compaction focuses on larger + * zones like ZONE_NORMAL, rather than smaller, specialized zones like + * ZONE_DMA32. For smaller zones, the score value remains close to zero, + * and thus never exceeds the high threshold for proactive compaction. + */ +static int fragmentation_score_zone(struct zone *zone) +{ + unsigned long score; + + score = zone->present_pages * + extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); + return div64_ul(score, zone->zone_pgdat->node_present_pages + 1); +} + +/* + * The per-node proactive (background) compaction process is started by its + * corresponding kcompactd thread when the node's fragmentation score + * exceeds the high threshold. The compaction process remains active till + * the node's score falls below the low threshold, or one of the back-off + * conditions is met. + */ +static int fragmentation_score_node(pg_data_t *pgdat) +{ + unsigned long score = 0; + int zoneid; + + for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) { + struct zone *zone; + + zone = &pgdat->node_zones[zoneid]; + score += fragmentation_score_zone(zone); + } + + return score; +} + +static int fragmentation_score_wmark(pg_data_t *pgdat, bool low) +{ + int wmark_low; + + /* + * Cap the low watermak to avoid excessive compaction + * activity in case a user sets the proactivess tunable + * close to 100 (maximum). + */ + wmark_low = max(100 - sysctl_compaction_proactiveness, 5); + return low ? wmark_low : min(wmark_low + 10, 100); +} + +static bool should_proactive_compact_node(pg_data_t *pgdat) +{ + int wmark_high; + + if (!sysctl_compaction_proactiveness || kswapd_is_running(pgdat)) + return false; + + wmark_high = fragmentation_score_wmark(pgdat, false); + return fragmentation_score_node(pgdat) > wmark_high; +} + static enum compact_result __compact_finished(struct compact_control *cc) { unsigned int order; @@ -1883,6 +1971,25 @@ static enum compact_result __compact_finished(struct compact_control *cc) return COMPACT_PARTIAL_SKIPPED; } + if (cc->proactive_compaction) { + int score, wmark_low; + pg_data_t *pgdat; + + pgdat = cc->zone->zone_pgdat; + if (kswapd_is_running(pgdat)) + return COMPACT_PARTIAL_SKIPPED; + + score = fragmentation_score_zone(cc->zone); + wmark_low = fragmentation_score_wmark(pgdat, true); + + if (score > wmark_low) + ret = COMPACT_CONTINUE; + else + ret = COMPACT_SUCCESS; + + goto out; + } + if (is_via_compact_memory(cc->order)) return COMPACT_CONTINUE; @@ -1941,6 +2048,7 @@ static enum compact_result __compact_finished(struct compact_control *cc) } } +out: if (cc->contended || fatal_signal_pending(current)) ret = COMPACT_CONTENDED; @@ -2410,6 +2518,41 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, return rc; } +/* + * Compact all zones within a node till each zone's fragmentation score + * reaches within proactive compaction thresholds (as determined by the + * proactiveness tunable). + * + * It is possible that the function returns before reaching score targets + * due to various back-off conditions, such as, contention on per-node or + * per-zone locks. + */ +static void proactive_compact_node(pg_data_t *pgdat) +{ + int zoneid; + struct zone *zone; + struct compact_control cc = { + .order = -1, + .mode = MIGRATE_SYNC_LIGHT, + .ignore_skip_hint = true, + .whole_zone = true, + .gfp_mask = GFP_KERNEL, + .proactive_compaction = true, + }; + + for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) { + zone = &pgdat->node_zones[zoneid]; + if (!populated_zone(zone)) + continue; + + cc.zone = zone; + + compact_zone(&cc, NULL); + + VM_BUG_ON(!list_empty(&cc.freepages)); + VM_BUG_ON(!list_empty(&cc.migratepages)); + } +} /* Compact all zones within a node */ static void compact_node(int nid) @@ -2456,6 +2599,13 @@ static void compact_nodes(void) /* The written value is actually unused, all memory is compacted */ int sysctl_compact_memory; +/* + * Tunable for proactive compaction. It determines how + * aggressively the kernel should compact memory in the + * background. It takes values in the range [0, 100]. + */ +int __read_mostly sysctl_compaction_proactiveness = 20; + /* * This is the entry point for compacting all nodes via * /proc/sys/vm/compact_memory @@ -2635,6 +2785,7 @@ static int kcompactd(void *p) { pg_data_t *pgdat = (pg_data_t*)p; struct task_struct *tsk = current; + unsigned int proactive_defer = 0; const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id); @@ -2650,12 +2801,34 @@ static int kcompactd(void *p) unsigned long pflags; trace_mm_compaction_kcompactd_sleep(pgdat->node_id); - wait_event_freezable(pgdat->kcompactd_wait, - kcompactd_work_requested(pgdat)); + if (wait_event_freezable_timeout(pgdat->kcompactd_wait, + kcompactd_work_requested(pgdat), + msecs_to_jiffies(HPAGE_FRAG_CHECK_INTERVAL_MSEC))) { + + psi_memstall_enter(&pflags); + kcompactd_do_work(pgdat); + psi_memstall_leave(&pflags); + continue; + } - psi_memstall_enter(&pflags); - kcompactd_do_work(pgdat); - psi_memstall_leave(&pflags); + /* kcompactd wait timeout */ + if (should_proactive_compact_node(pgdat)) { + unsigned int prev_score, score; + + if (proactive_defer) { + proactive_defer--; + continue; + } + prev_score = fragmentation_score_node(pgdat); + proactive_compact_node(pgdat); + score = fragmentation_score_node(pgdat); + /* + * Defer proactive compaction if the fragmentation + * score did not go down i.e. no progress made. + */ + proactive_defer = score < prev_score ? + 0 : 1 << COMPACT_MAX_DEFER_SHIFT; + } } return 0; diff --git a/mm/internal.h b/mm/internal.h index 9886db20d94f..42cf0b610847 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -239,6 +239,7 @@ struct compact_control { bool no_set_skip_hint; /* Don't mark blocks for skipping */ bool ignore_block_suitable; /* Scan blocks considered unsuitable */ bool direct_compaction; /* False from kcompactd or /proc/... */ + bool proactive_compaction; /* kcompactd proactive compaction */ bool whole_zone; /* Whole zone should/has been scanned */ bool contended; /* Signal lock or sched contention */ bool rescan; /* Rescanning the same pageblock */ diff --git a/mm/vmstat.c b/mm/vmstat.c index 3fb23a21f6dd..3e7ba8bce2ba 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1074,6 +1074,24 @@ static int __fragmentation_index(unsigned int order, struct contig_page_info *in return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL, requested))), info->free_blocks_total); } +/* + * Calculates external fragmentation within a zone wrt the given order. + * It is defined as the percentage of pages found in blocks of size + * less than 1 << order. It returns values in range [0, 100]. + */ +int extfrag_for_order(struct zone *zone, unsigned int order) +{ + struct contig_page_info info; + + fill_contig_page_info(zone, order, &info); + if (info.free_pages == 0) + return 0; + + return div_u64((info.free_pages - + (info.free_blocks_suitable << order)) * 100, + info.free_pages); +} + /* Same as __fragmentation index but allocs contig_page_info on stack */ int fragmentation_index(struct zone *zone, unsigned int order) {