Message ID | 20240415081220.3246839-1-wangkefeng.wang@huawei.com (mailing list archive) |
---|---|
Headers | show |
Series | mm: allow more high-order pages stored on PCP lists | expand |
On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote: > > Both the file pages and anonymous pages support large folio, high-order > pages except PMD_ORDER will also be allocated frequently which could > increase the zone lock contention, allow high-order pages on pcp lists > could reduce the big zone lock contention, but as commit 44042b449872 > ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") > pointed, it may not win in all the scenes, add a new control sysfs to > enable or disable specified high-order pages stored on PCP lists, the order > (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by default. This is precisely something Baolin and I have discussed and intended to implement[1], but unfortunately, we haven't had the time to do so. [1] https://lore.kernel.org/linux-mm/13c59ca8-baac-405e-8640-e693c78ef79a@suse.cz/T/#mecb0514ced830ac4df320113bedd7073bea9ab7a > > With perf lock tools, the lock contention from will-it-scale page_fault1 > (with 90 tasks run 10s, hugepage-2048KB never, hugepage-64K always) show > below(only care about zone spinlock and pcp spinlock), > > Without patches, > contended total wait max wait avg wait type caller > 713 4.64 ms 74.37 us 6.51 us spinlock __alloc_pages+0x23c > > With patches, > contended total wait max wait avg wait type caller > 2 25.66 us 16.31 us 12.83 us spinlock rmqueue_pcplist+0x2b0 > > Similar results on shell8 from unixbench, > > Without patches, > 4942 901.09 ms 1.31 ms 182.33 us spinlock __alloc_pages+0x23c > 1556 298.76 ms 1.23 ms 192.01 us spinlock rmqueue_pcplist+0x2b0 > 991 182.73 ms 879.80 us 184.39 us spinlock rmqueue_pcplist+0x2b0 > > With patches, > contended total wait max wait avg wait type caller > 988 187.63 ms 855.18 us 189.91 us spinlock rmqueue_pcplist+0x2b0 > 505 88.99 ms 793.27 us 176.21 us spinlock rmqueue_pcplist+0x2b0 > > The Benchmarks Score shows a little improvoment(0.28%) from shell8, but the > zone lock from __alloc_pages() disappeared. > > Kefeng Wang (3): > mm: prepare more high-order pages to be stored on the per-cpu lists > mm: add control to allow specified high-order pages stored on PCP list > mm: pcp: show each order page count > > Documentation/admin-guide/mm/transhuge.rst | 11 ++++ > include/linux/gfp.h | 1 + > include/linux/huge_mm.h | 1 + > include/linux/mmzone.h | 10 ++- > include/linux/vmstat.h | 19 ++++++ > mm/Kconfig.debug | 8 +++ > mm/huge_memory.c | 74 ++++++++++++++++++++++ > mm/page_alloc.c | 30 +++++++-- > mm/vmstat.c | 16 +++++ > 9 files changed, 164 insertions(+), 6 deletions(-) > > -- > 2.27.0 > >
On 2024/4/15 16:18, Barry Song wrote: > On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote: >> >> Both the file pages and anonymous pages support large folio, high-order >> pages except PMD_ORDER will also be allocated frequently which could >> increase the zone lock contention, allow high-order pages on pcp lists >> could reduce the big zone lock contention, but as commit 44042b449872 >> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") >> pointed, it may not win in all the scenes, add a new control sysfs to >> enable or disable specified high-order pages stored on PCP lists, the order >> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by default. > > This is precisely something Baolin and I have discussed and intended > to implement[1], > but unfortunately, we haven't had the time to do so. Indeed, same thing. Recently, we are working on unixbench/lmbench optimization, I tested Multi-size THP for anonymous memory by hard-cord PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but not for all cases and not very stable, so re-implemented it by according to the user requirement and enable it dynamically. [1] https://lore.kernel.org/linux-mm/b8f5a47a-af1e-44ed-a89b-460d0be56d2c@huawei.com/ > > [1] https://lore.kernel.org/linux-mm/13c59ca8-baac-405e-8640-e693c78ef79a@suse.cz/T/#mecb0514ced830ac4df320113bedd7073bea9ab7a > >> >> With perf lock tools, the lock contention from will-it-scale page_fault1 >> (with 90 tasks run 10s, hugepage-2048KB never, hugepage-64K always) show >> below(only care about zone spinlock and pcp spinlock), >> >> Without patches, >> contended total wait max wait avg wait type caller >> 713 4.64 ms 74.37 us 6.51 us spinlock __alloc_pages+0x23c >> >> With patches, >> contended total wait max wait avg wait type caller >> 2 25.66 us 16.31 us 12.83 us spinlock rmqueue_pcplist+0x2b0 >> >> Similar results on shell8 from unixbench, >> >> Without patches, >> 4942 901.09 ms 1.31 ms 182.33 us spinlock __alloc_pages+0x23c >> 1556 298.76 ms 1.23 ms 192.01 us spinlock rmqueue_pcplist+0x2b0 >> 991 182.73 ms 879.80 us 184.39 us spinlock rmqueue_pcplist+0x2b0 >> >> With patches, >> contended total wait max wait avg wait type caller >> 988 187.63 ms 855.18 us 189.91 us spinlock rmqueue_pcplist+0x2b0 >> 505 88.99 ms 793.27 us 176.21 us spinlock rmqueue_pcplist+0x2b0 >> >> The Benchmarks Score shows a little improvoment(0.28%) from shell8, but the >> zone lock from __alloc_pages() disappeared. >> >> Kefeng Wang (3): >> mm: prepare more high-order pages to be stored on the per-cpu lists >> mm: add control to allow specified high-order pages stored on PCP list >> mm: pcp: show each order page count >> >> Documentation/admin-guide/mm/transhuge.rst | 11 ++++ >> include/linux/gfp.h | 1 + >> include/linux/huge_mm.h | 1 + >> include/linux/mmzone.h | 10 ++- >> include/linux/vmstat.h | 19 ++++++ >> mm/Kconfig.debug | 8 +++ >> mm/huge_memory.c | 74 ++++++++++++++++++++++ >> mm/page_alloc.c | 30 +++++++-- >> mm/vmstat.c | 16 +++++ >> 9 files changed, 164 insertions(+), 6 deletions(-) >> >> -- >> 2.27.0 >> >> >
On 15.04.24 10:59, Kefeng Wang wrote: > > > On 2024/4/15 16:18, Barry Song wrote: >> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote: >>> >>> Both the file pages and anonymous pages support large folio, high-order >>> pages except PMD_ORDER will also be allocated frequently which could >>> increase the zone lock contention, allow high-order pages on pcp lists >>> could reduce the big zone lock contention, but as commit 44042b449872 >>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") >>> pointed, it may not win in all the scenes, add a new control sysfs to >>> enable or disable specified high-order pages stored on PCP lists, the order >>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by default. >> >> This is precisely something Baolin and I have discussed and intended >> to implement[1], >> but unfortunately, we haven't had the time to do so. > > Indeed, same thing. Recently, we are working on unixbench/lmbench > optimization, I tested Multi-size THP for anonymous memory by hard-cord > PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but > not for all cases and not very stable, so re-implemented it by according > to the user requirement and enable it dynamically. I'm wondering, though, if this is really a suitable candidate for a sysctl toggle. Can anybody really come up with an educated guess for these values? Especially reading "Benchmarks Score shows a little improvoment(0.28%)" and "it may not win in all the scenes", to me it mostly sounds like "minimal impact" -- so who cares? How much is the cost vs. benefit of just having one sane system configuration?
On Mon, Apr 15, 2024 at 6:52 PM David Hildenbrand <david@redhat.com> wrote: > > On 15.04.24 10:59, Kefeng Wang wrote: > > > > > > On 2024/4/15 16:18, Barry Song wrote: > >> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote: > >>> > >>> Both the file pages and anonymous pages support large folio, high-order > >>> pages except PMD_ORDER will also be allocated frequently which could > >>> increase the zone lock contention, allow high-order pages on pcp lists > >>> could reduce the big zone lock contention, but as commit 44042b449872 > >>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") > >>> pointed, it may not win in all the scenes, add a new control sysfs to > >>> enable or disable specified high-order pages stored on PCP lists, the order > >>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by default. > >> > >> This is precisely something Baolin and I have discussed and intended > >> to implement[1], > >> but unfortunately, we haven't had the time to do so. > > > > Indeed, same thing. Recently, we are working on unixbench/lmbench > > optimization, I tested Multi-size THP for anonymous memory by hard-cord > > PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but > > not for all cases and not very stable, so re-implemented it by according > > to the user requirement and enable it dynamically. > > I'm wondering, though, if this is really a suitable candidate for a > sysctl toggle. Can anybody really come up with an educated guess for > these values? > > Especially reading "Benchmarks Score shows a little improvoment(0.28%)" > and "it may not win in all the scenes", to me it mostly sounds like > "minimal impact" -- so who cares? Considering the original goal of employing PCP to alleviate page allocation lock contention, and now that we have configured mTHP, for instance, to 64KiB, it's possible that 64KiB could become the most common page allocation size just like order0. We should expect to see similar improvements as a result. I'm questioning whether shell8 is the suitable benchmark for this situation. A mere 0.28% performance enhancement might not be substantial to pique interest. Shouldn't we have numerous threads allocating and freeing in parallel to truly gauge the benefits of PCP? > > How much is the cost vs. benefit of just having one sane system > configuration? > > -- > Cheers, > > David / dhildenb >
On 2024/4/15 18:52, David Hildenbrand wrote: > On 15.04.24 10:59, Kefeng Wang wrote: >> >> >> On 2024/4/15 16:18, Barry Song wrote: >>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang >>> <wangkefeng.wang@huawei.com> wrote: >>>> >>>> Both the file pages and anonymous pages support large folio, high-order >>>> pages except PMD_ORDER will also be allocated frequently which could >>>> increase the zone lock contention, allow high-order pages on pcp lists >>>> could reduce the big zone lock contention, but as commit 44042b449872 >>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu >>>> lists") >>>> pointed, it may not win in all the scenes, add a new control sysfs to >>>> enable or disable specified high-order pages stored on PCP lists, >>>> the order >>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by >>>> default. >>> >>> This is precisely something Baolin and I have discussed and intended >>> to implement[1], >>> but unfortunately, we haven't had the time to do so. >> >> Indeed, same thing. Recently, we are working on unixbench/lmbench >> optimization, I tested Multi-size THP for anonymous memory by hard-cord >> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but >> not for all cases and not very stable, so re-implemented it by according >> to the user requirement and enable it dynamically. > > I'm wondering, though, if this is really a suitable candidate for a > sysctl toggle. Can anybody really come up with an educated guess for > these values? Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl, we could trace __alloc_pages() and do order statistic to decide to choose the high-order to be enabled on PCP. > > Especially reading "Benchmarks Score shows a little improvoment(0.28%)" > and "it may not win in all the scenes", to me it mostly sounds like > "minimal impact" -- so who cares? Even though lock conflicts are eliminated, there is very limited performance improvement(even maybe fluctuation), it is not a good testcase to show improvement, just show the zone-lock issue, we need to find other better testcase, maybe some test on Andriod(heavy use 64K, no PMD THP), or LKP maybe give some help? I will try to find other testcase to show the benefit. > > How much is the cost vs. benefit of just having one sane system > configuration? > For arm64 with 4k, five more high-orders(4~8), five more pcplists, and for high-orders, we assumes most of them are moveable, but maybe not, so enable it by default maybe more fragmentization, see 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized allocations").
On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote: > > > > On 2024/4/15 18:52, David Hildenbrand wrote: > > On 15.04.24 10:59, Kefeng Wang wrote: > >> > >> > >> On 2024/4/15 16:18, Barry Song wrote: > >>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang > >>> <wangkefeng.wang@huawei.com> wrote: > >>>> > >>>> Both the file pages and anonymous pages support large folio, high-order > >>>> pages except PMD_ORDER will also be allocated frequently which could > >>>> increase the zone lock contention, allow high-order pages on pcp lists > >>>> could reduce the big zone lock contention, but as commit 44042b449872 > >>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu > >>>> lists") > >>>> pointed, it may not win in all the scenes, add a new control sysfs to > >>>> enable or disable specified high-order pages stored on PCP lists, > >>>> the order > >>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by > >>>> default. > >>> > >>> This is precisely something Baolin and I have discussed and intended > >>> to implement[1], > >>> but unfortunately, we haven't had the time to do so. > >> > >> Indeed, same thing. Recently, we are working on unixbench/lmbench > >> optimization, I tested Multi-size THP for anonymous memory by hard-cord > >> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but > >> not for all cases and not very stable, so re-implemented it by according > >> to the user requirement and enable it dynamically. > > > > I'm wondering, though, if this is really a suitable candidate for a > > sysctl toggle. Can anybody really come up with an educated guess for > > these values? > > Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl, > we could trace __alloc_pages() and do order statistic to decide to > choose the high-order to be enabled on PCP. > > > > > Especially reading "Benchmarks Score shows a little improvoment(0.28%)" > > and "it may not win in all the scenes", to me it mostly sounds like > > "minimal impact" -- so who cares? > > Even though lock conflicts are eliminated, there is very limited > performance improvement(even maybe fluctuation), it is not a good > testcase to show improvement, just show the zone-lock issue, we need to > find other better testcase, maybe some test on Andriod(heavy use 64K, no > PMD THP), or LKP maybe give some help? > > I will try to find other testcase to show the benefit. Hi Kefeng, I wonder if you will see some major improvements on mTHP 64KiB using the below microbench I wrote just now, for example perf and time to finish the program #define DATA_SIZE (2UL * 1024 * 1024) int main(int argc, char **argv) { /* make 32 concurrent alloc and free of mTHP */ fork(); fork(); fork(); fork(); fork(); for (int i = 0; i < 100000; i++) { void *addr = mmap(NULL, DATA_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); if (addr == MAP_FAILED) { perror("fail to malloc"); return -1; } memset(addr, 0x11, DATA_SIZE); munmap(addr, DATA_SIZE); } return 0; } > > > > > How much is the cost vs. benefit of just having one sane system > > configuration? > > > > For arm64 with 4k, five more high-orders(4~8), five more pcplists, > and for high-orders, we assumes most of them are moveable, but maybe > not, so enable it by default maybe more fragmentization, see > 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized > allocations"). >
On 2024/4/16 8:21, Barry Song wrote: > On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote: >> >> >> >> On 2024/4/15 18:52, David Hildenbrand wrote: >>> On 15.04.24 10:59, Kefeng Wang wrote: >>>> >>>> >>>> On 2024/4/15 16:18, Barry Song wrote: >>>>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang >>>>> <wangkefeng.wang@huawei.com> wrote: >>>>>> >>>>>> Both the file pages and anonymous pages support large folio, high-order >>>>>> pages except PMD_ORDER will also be allocated frequently which could >>>>>> increase the zone lock contention, allow high-order pages on pcp lists >>>>>> could reduce the big zone lock contention, but as commit 44042b449872 >>>>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu >>>>>> lists") >>>>>> pointed, it may not win in all the scenes, add a new control sysfs to >>>>>> enable or disable specified high-order pages stored on PCP lists, >>>>>> the order >>>>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by >>>>>> default. >>>>> >>>>> This is precisely something Baolin and I have discussed and intended >>>>> to implement[1], >>>>> but unfortunately, we haven't had the time to do so. >>>> >>>> Indeed, same thing. Recently, we are working on unixbench/lmbench >>>> optimization, I tested Multi-size THP for anonymous memory by hard-cord >>>> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but >>>> not for all cases and not very stable, so re-implemented it by according >>>> to the user requirement and enable it dynamically. >>> >>> I'm wondering, though, if this is really a suitable candidate for a >>> sysctl toggle. Can anybody really come up with an educated guess for >>> these values? >> >> Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl, >> we could trace __alloc_pages() and do order statistic to decide to >> choose the high-order to be enabled on PCP. >> >>> >>> Especially reading "Benchmarks Score shows a little improvoment(0.28%)" >>> and "it may not win in all the scenes", to me it mostly sounds like >>> "minimal impact" -- so who cares? >> >> Even though lock conflicts are eliminated, there is very limited >> performance improvement(even maybe fluctuation), it is not a good >> testcase to show improvement, just show the zone-lock issue, we need to >> find other better testcase, maybe some test on Andriod(heavy use 64K, no >> PMD THP), or LKP maybe give some help? >> >> I will try to find other testcase to show the benefit. > > Hi Kefeng, > > I wonder if you will see some major improvements on mTHP 64KiB using > the below microbench I wrote just now, for example perf and time to > finish the program > > #define DATA_SIZE (2UL * 1024 * 1024) > > int main(int argc, char **argv) > { > /* make 32 concurrent alloc and free of mTHP */ > fork(); fork(); fork(); fork(); fork(); > > for (int i = 0; i < 100000; i++) { > void *addr = mmap(NULL, DATA_SIZE, PROT_READ | PROT_WRITE, > MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); > if (addr == MAP_FAILED) { > perror("fail to malloc"); > return -1; > } > memset(addr, 0x11, DATA_SIZE); > munmap(addr, DATA_SIZE); > } > > return 0; > } > 1) PCP disabled 1 2 3 4 5 average real 200.41 202.18 203.16 201.54 200.91 201.64 user 6.49 6.21 6.25 6.31 6.35 6.322 sys 193.3 195.39 196.3 194.65 194.01 194.73 2) PCP enabled real 198.25 199.26 195.51 199.28 189.12 196.284 -2.66% user 6.21 6.02 6.02 6.28 6.21 6.148 -2.75% sys 191.46 192.64 188.96 192.47 182.39 189.584 -2.64% for above test, time reduce 2.x% And re-test page_fault1(anon) from will-it-scale 1) PCP enabled tasks processes processes_idle threads threads_idle linear 0 0 100 0 100 0 1 1416915 98.95 1418128 98.95 1418128 20 5327312 79.22 3821312 94.36 28362560 40 9437184 58.58 4463657 94.55 56725120 60 8120003 38.16 4736716 94.61 85087680 80 7356508 18.29 4847824 94.46 113450240 100 7256185 1.48 4870096 94.61 141812800 2) PCP disabled tasks processes processes_idle threads threads_idle linear 0 0 100 0 100 0 1 1365398 98.95 1354502 98.95 1365398 20 5174918 79.22 3722368 94.65 27307960 40 9094265 58.58 4427267 94.82 54615920 60 8021606 38.18 4572896 94.93 81923880 80 7497318 18.2 4637062 94.76 109231840 100 6819897 1.47 4654521 94.63 136539800 ------------------------------------ 1) vs 2) pcp enabled improve 3.86% 3) PCP re-enabled tasks processes processes_idle threads threads_idle linear 0 0 100 0 100 0 1 1419036 98.96 1428403 98.95 1428403 20 5356092 79.23 3851849 94.41 28568060 40 9437184 58.58 4512918 94.63 57136120 60 8252342 38.16 4659552 94.68 85704180 80 7414899 18.26 4790576 94.77 114272240 100 7062902 1.46 4759030 94.64 142840300 4) PCP re-disabled tasks processes processes_idle threads threads_idle linear 0 0 100 0 100 0 1 1352649 98.95 1354806 98.95 1354806 20 5172924 79.22 3719292 94.64 27096120 40 9174505 58.59 4310649 94.93 54192240 60 8021606 38.17 4552960 94.81 81288360 80 7497318 18.18 4671638 94.81 108384480 100 6823926 1.47 4725955 94.64 135480600 ------------------------------------ 3) vs 4) pcp enabled improve 5.43% Average: 4.645% >> >>> >>> How much is the cost vs. benefit of just having one sane system >>> configuration? >>> >> >> For arm64 with 4k, five more high-orders(4~8), five more pcplists, >> and for high-orders, we assumes most of them are moveable, but maybe >> not, so enable it by default maybe more fragmentization, see >> 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized >> allocations"). >>
On 2024/4/16 12:50, Kefeng Wang wrote: > > > On 2024/4/16 8:21, Barry Song wrote: >> On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang >> <wangkefeng.wang@huawei.com> wrote: >>> >>> >>> >>> On 2024/4/15 18:52, David Hildenbrand wrote: >>>> On 15.04.24 10:59, Kefeng Wang wrote: >>>>> >>>>> >>>>> On 2024/4/15 16:18, Barry Song wrote: >>>>>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang >>>>>> <wangkefeng.wang@huawei.com> wrote: >>>>>>> >>>>>>> Both the file pages and anonymous pages support large folio, >>>>>>> high-order >>>>>>> pages except PMD_ORDER will also be allocated frequently which could >>>>>>> increase the zone lock contention, allow high-order pages on pcp >>>>>>> lists >>>>>>> could reduce the big zone lock contention, but as commit >>>>>>> 44042b449872 >>>>>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu >>>>>>> lists") >>>>>>> pointed, it may not win in all the scenes, add a new control >>>>>>> sysfs to >>>>>>> enable or disable specified high-order pages stored on PCP lists, >>>>>>> the order >>>>>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by >>>>>>> default. >>>>>> >>>>>> This is precisely something Baolin and I have discussed and intended >>>>>> to implement[1], >>>>>> but unfortunately, we haven't had the time to do so. >>>>> >>>>> Indeed, same thing. Recently, we are working on unixbench/lmbench >>>>> optimization, I tested Multi-size THP for anonymous memory by >>>>> hard-cord >>>>> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but >>>>> not for all cases and not very stable, so re-implemented it by >>>>> according >>>>> to the user requirement and enable it dynamically. >>>> >>>> I'm wondering, though, if this is really a suitable candidate for a >>>> sysctl toggle. Can anybody really come up with an educated guess for >>>> these values? >>> >>> Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl, >>> we could trace __alloc_pages() and do order statistic to decide to >>> choose the high-order to be enabled on PCP. >>> >>>> >>>> Especially reading "Benchmarks Score shows a little improvoment(0.28%)" >>>> and "it may not win in all the scenes", to me it mostly sounds like >>>> "minimal impact" -- so who cares? >>> >>> Even though lock conflicts are eliminated, there is very limited >>> performance improvement(even maybe fluctuation), it is not a good >>> testcase to show improvement, just show the zone-lock issue, we need to >>> find other better testcase, maybe some test on Andriod(heavy use 64K, no >>> PMD THP), or LKP maybe give some help? >>> >>> I will try to find other testcase to show the benefit. >> >> Hi Kefeng, >> >> I wonder if you will see some major improvements on mTHP 64KiB using >> the below microbench I wrote just now, for example perf and time to >> finish the program >> >> #define DATA_SIZE (2UL * 1024 * 1024) >> >> int main(int argc, char **argv) >> { >> /* make 32 concurrent alloc and free of mTHP */ >> fork(); fork(); fork(); fork(); fork(); >> >> for (int i = 0; i < 100000; i++) { >> void *addr = mmap(NULL, DATA_SIZE, PROT_READ | >> PROT_WRITE, >> MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); >> if (addr == MAP_FAILED) { >> perror("fail to malloc"); >> return -1; >> } >> memset(addr, 0x11, DATA_SIZE); >> munmap(addr, DATA_SIZE); >> } >> >> return 0; >> } >> Rebased on next-20240415, echo never > /sys/kernel/mm/transparent_hugepage/enabled echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled Compare with echo 0 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled echo 1 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled > > 1) PCP disabled > 1 2 3 4 5 average > real 200.41 202.18 203.16 201.54 200.91 201.64 > user 6.49 6.21 6.25 6.31 6.35 6.322 > sys 193.3 195.39 196.3 194.65 194.01 194.73 > > 2) PCP enabled > real 198.25 199.26 195.51 199.28 189.12 196.284 > -2.66% > user 6.21 6.02 6.02 6.28 6.21 6.148 -2.75% > sys 191.46 192.64 188.96 192.47 182.39 189.584 > -2.64% > > for above test, time reduce 2.x% > > > And re-test page_fault1(anon) from will-it-scale > > 1) PCP enabled > tasks processes processes_idle threads threads_idle linear > 0 0 100 0 100 0 > 1 1416915 98.95 1418128 98.95 1418128 > 20 5327312 79.22 3821312 94.36 28362560 > 40 9437184 58.58 4463657 94.55 56725120 > 60 8120003 38.16 4736716 94.61 85087680 > 80 7356508 18.29 4847824 94.46 113450240 > 100 7256185 1.48 4870096 94.61 141812800 > > 2) PCP disabled > tasks processes processes_idle threads threads_idle linear > 0 0 100 0 100 0 > 1 1365398 98.95 1354502 98.95 1365398 > 20 5174918 79.22 3722368 94.65 27307960 > 40 9094265 58.58 4427267 94.82 54615920 > 60 8021606 38.18 4572896 94.93 81923880 > 80 7497318 18.2 4637062 94.76 109231840 > 100 6819897 1.47 4654521 94.63 136539800 > > ------------------------------------ > 1) vs 2) pcp enabled improve 3.86% > > 3) PCP re-enabled > tasks processes processes_idle threads threads_idle linear > 0 0 100 0 100 0 > 1 1419036 98.96 1428403 98.95 1428403 > 20 5356092 79.23 3851849 94.41 28568060 > 40 9437184 58.58 4512918 94.63 57136120 > 60 8252342 38.16 4659552 94.68 85704180 > 80 7414899 18.26 4790576 94.77 114272240 > 100 7062902 1.46 4759030 94.64 142840300 > > 4) PCP re-disabled > tasks processes processes_idle threads threads_idle linear > 0 0 100 0 100 0 > 1 1352649 98.95 1354806 98.95 1354806 > 20 5172924 79.22 3719292 94.64 27096120 > 40 9174505 58.59 4310649 94.93 54192240 > 60 8021606 38.17 4552960 94.81 81288360 > 80 7497318 18.18 4671638 94.81 108384480 > 100 6823926 1.47 4725955 94.64 135480600 > > ------------------------------------ > 3) vs 4) pcp enabled improve 5.43% > > Average: 4.645% > > > > > >>> >>>> >>>> How much is the cost vs. benefit of just having one sane system >>>> configuration? >>>> >>> >>> For arm64 with 4k, five more high-orders(4~8), five more pcplists, >>> and for high-orders, we assumes most of them are moveable, but maybe >>> not, so enable it by default maybe more fragmentization, see >>> 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized >>> allocations"). >>> >
On Tue, Apr 16, 2024 at 4:58 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote: > > > > On 2024/4/16 12:50, Kefeng Wang wrote: > > > > > > On 2024/4/16 8:21, Barry Song wrote: > >> On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang > >> <wangkefeng.wang@huawei.com> wrote: > >>> > >>> > >>> > >>> On 2024/4/15 18:52, David Hildenbrand wrote: > >>>> On 15.04.24 10:59, Kefeng Wang wrote: > >>>>> > >>>>> > >>>>> On 2024/4/15 16:18, Barry Song wrote: > >>>>>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang > >>>>>> <wangkefeng.wang@huawei.com> wrote: > >>>>>>> > >>>>>>> Both the file pages and anonymous pages support large folio, > >>>>>>> high-order > >>>>>>> pages except PMD_ORDER will also be allocated frequently which could > >>>>>>> increase the zone lock contention, allow high-order pages on pcp > >>>>>>> lists > >>>>>>> could reduce the big zone lock contention, but as commit > >>>>>>> 44042b449872 > >>>>>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu > >>>>>>> lists") > >>>>>>> pointed, it may not win in all the scenes, add a new control > >>>>>>> sysfs to > >>>>>>> enable or disable specified high-order pages stored on PCP lists, > >>>>>>> the order > >>>>>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by > >>>>>>> default. > >>>>>> > >>>>>> This is precisely something Baolin and I have discussed and intended > >>>>>> to implement[1], > >>>>>> but unfortunately, we haven't had the time to do so. > >>>>> > >>>>> Indeed, same thing. Recently, we are working on unixbench/lmbench > >>>>> optimization, I tested Multi-size THP for anonymous memory by > >>>>> hard-cord > >>>>> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but > >>>>> not for all cases and not very stable, so re-implemented it by > >>>>> according > >>>>> to the user requirement and enable it dynamically. > >>>> > >>>> I'm wondering, though, if this is really a suitable candidate for a > >>>> sysctl toggle. Can anybody really come up with an educated guess for > >>>> these values? > >>> > >>> Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl, > >>> we could trace __alloc_pages() and do order statistic to decide to > >>> choose the high-order to be enabled on PCP. > >>> > >>>> > >>>> Especially reading "Benchmarks Score shows a little improvoment(0.28%)" > >>>> and "it may not win in all the scenes", to me it mostly sounds like > >>>> "minimal impact" -- so who cares? > >>> > >>> Even though lock conflicts are eliminated, there is very limited > >>> performance improvement(even maybe fluctuation), it is not a good > >>> testcase to show improvement, just show the zone-lock issue, we need to > >>> find other better testcase, maybe some test on Andriod(heavy use 64K, no > >>> PMD THP), or LKP maybe give some help? > >>> > >>> I will try to find other testcase to show the benefit. > >> > >> Hi Kefeng, > >> > >> I wonder if you will see some major improvements on mTHP 64KiB using > >> the below microbench I wrote just now, for example perf and time to > >> finish the program > >> > >> #define DATA_SIZE (2UL * 1024 * 1024) > >> > >> int main(int argc, char **argv) > >> { > >> /* make 32 concurrent alloc and free of mTHP */ > >> fork(); fork(); fork(); fork(); fork(); > >> > >> for (int i = 0; i < 100000; i++) { > >> void *addr = mmap(NULL, DATA_SIZE, PROT_READ | > >> PROT_WRITE, > >> MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); > >> if (addr == MAP_FAILED) { > >> perror("fail to malloc"); > >> return -1; > >> } > >> memset(addr, 0x11, DATA_SIZE); > >> munmap(addr, DATA_SIZE); > >> } > >> > >> return 0; > >> } > >> > > Rebased on next-20240415, > > echo never > /sys/kernel/mm/transparent_hugepage/enabled > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled > > Compare with > echo 0 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled > echo 1 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled > > > > > 1) PCP disabled > > 1 2 3 4 5 average > > real 200.41 202.18 203.16 201.54 200.91 201.64 > > user 6.49 6.21 6.25 6.31 6.35 6.322 > > sys 193.3 195.39 196.3 194.65 194.01 194.73 > > > > 2) PCP enabled > > real 198.25 199.26 195.51 199.28 189.12 196.284 > > -2.66% > > user 6.21 6.02 6.02 6.28 6.21 6.148 -2.75% > > sys 191.46 192.64 188.96 192.47 182.39 189.584 > > -2.64% > > > > for above test, time reduce 2.x% This is an improvement from 0.28%, but it's still below my expectations. I suspect it's due to mTHP reducing the frequency of allocations and frees. Running the same test on order-0 might yield much better results. I suppose that as the order increases, PCP exhibits fewer improvements since both allocation and release activities decrease. Conversely, we also employ PCP for THP (2MB). Do we have any data demonstrating that such large-size allocations can benefit from PCP before ? > > > > > > And re-test page_fault1(anon) from will-it-scale > > > > 1) PCP enabled > > tasks processes processes_idle threads threads_idle linear > > 0 0 100 0 100 0 > > 1 1416915 98.95 1418128 98.95 1418128 > > 20 5327312 79.22 3821312 94.36 28362560 > > 40 9437184 58.58 4463657 94.55 56725120 > > 60 8120003 38.16 4736716 94.61 85087680 > > 80 7356508 18.29 4847824 94.46 113450240 > > 100 7256185 1.48 4870096 94.61 141812800 > > > > 2) PCP disabled > > tasks processes processes_idle threads threads_idle linear > > 0 0 100 0 100 0 > > 1 1365398 98.95 1354502 98.95 1365398 > > 20 5174918 79.22 3722368 94.65 27307960 > > 40 9094265 58.58 4427267 94.82 54615920 > > 60 8021606 38.18 4572896 94.93 81923880 > > 80 7497318 18.2 4637062 94.76 109231840 > > 100 6819897 1.47 4654521 94.63 136539800 > > > > ------------------------------------ > > 1) vs 2) pcp enabled improve 3.86% > > > > 3) PCP re-enabled > > tasks processes processes_idle threads threads_idle linear > > 0 0 100 0 100 0 > > 1 1419036 98.96 1428403 98.95 1428403 > > 20 5356092 79.23 3851849 94.41 28568060 > > 40 9437184 58.58 4512918 94.63 57136120 > > 60 8252342 38.16 4659552 94.68 85704180 > > 80 7414899 18.26 4790576 94.77 114272240 > > 100 7062902 1.46 4759030 94.64 142840300 > > > > 4) PCP re-disabled > > tasks processes processes_idle threads threads_idle linear > > 0 0 100 0 100 0 > > 1 1352649 98.95 1354806 98.95 1354806 > > 20 5172924 79.22 3719292 94.64 27096120 > > 40 9174505 58.59 4310649 94.93 54192240 > > 60 8021606 38.17 4552960 94.81 81288360 > > 80 7497318 18.18 4671638 94.81 108384480 > > 100 6823926 1.47 4725955 94.64 135480600 > > > > ------------------------------------ > > 3) vs 4) pcp enabled improve 5.43% > > > > Average: 4.645% > > > > > > > > > > > >>> > >>>> > >>>> How much is the cost vs. benefit of just having one sane system > >>>> configuration? > >>>> > >>> > >>> For arm64 with 4k, five more high-orders(4~8), five more pcplists, > >>> and for high-orders, we assumes most of them are moveable, but maybe > >>> not, so enable it by default maybe more fragmentization, see > >>> 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized > >>> allocations"). > >>> Thanks Barry
On 16.04.24 07:26, Barry Song wrote: > On Tue, Apr 16, 2024 at 4:58 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote: >> >> >> >> On 2024/4/16 12:50, Kefeng Wang wrote: >>> >>> >>> On 2024/4/16 8:21, Barry Song wrote: >>>> On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang >>>> <wangkefeng.wang@huawei.com> wrote: >>>>> >>>>> >>>>> >>>>> On 2024/4/15 18:52, David Hildenbrand wrote: >>>>>> On 15.04.24 10:59, Kefeng Wang wrote: >>>>>>> >>>>>>> >>>>>>> On 2024/4/15 16:18, Barry Song wrote: >>>>>>>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang >>>>>>>> <wangkefeng.wang@huawei.com> wrote: >>>>>>>>> >>>>>>>>> Both the file pages and anonymous pages support large folio, >>>>>>>>> high-order >>>>>>>>> pages except PMD_ORDER will also be allocated frequently which could >>>>>>>>> increase the zone lock contention, allow high-order pages on pcp >>>>>>>>> lists >>>>>>>>> could reduce the big zone lock contention, but as commit >>>>>>>>> 44042b449872 >>>>>>>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu >>>>>>>>> lists") >>>>>>>>> pointed, it may not win in all the scenes, add a new control >>>>>>>>> sysfs to >>>>>>>>> enable or disable specified high-order pages stored on PCP lists, >>>>>>>>> the order >>>>>>>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by >>>>>>>>> default. >>>>>>>> >>>>>>>> This is precisely something Baolin and I have discussed and intended >>>>>>>> to implement[1], >>>>>>>> but unfortunately, we haven't had the time to do so. >>>>>>> >>>>>>> Indeed, same thing. Recently, we are working on unixbench/lmbench >>>>>>> optimization, I tested Multi-size THP for anonymous memory by >>>>>>> hard-cord >>>>>>> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but >>>>>>> not for all cases and not very stable, so re-implemented it by >>>>>>> according >>>>>>> to the user requirement and enable it dynamically. >>>>>> >>>>>> I'm wondering, though, if this is really a suitable candidate for a >>>>>> sysctl toggle. Can anybody really come up with an educated guess for >>>>>> these values? >>>>> >>>>> Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl, >>>>> we could trace __alloc_pages() and do order statistic to decide to >>>>> choose the high-order to be enabled on PCP. >>>>> >>>>>> >>>>>> Especially reading "Benchmarks Score shows a little improvoment(0.28%)" >>>>>> and "it may not win in all the scenes", to me it mostly sounds like >>>>>> "minimal impact" -- so who cares? >>>>> >>>>> Even though lock conflicts are eliminated, there is very limited >>>>> performance improvement(even maybe fluctuation), it is not a good >>>>> testcase to show improvement, just show the zone-lock issue, we need to >>>>> find other better testcase, maybe some test on Andriod(heavy use 64K, no >>>>> PMD THP), or LKP maybe give some help? >>>>> >>>>> I will try to find other testcase to show the benefit. >>>> >>>> Hi Kefeng, >>>> >>>> I wonder if you will see some major improvements on mTHP 64KiB using >>>> the below microbench I wrote just now, for example perf and time to >>>> finish the program >>>> >>>> #define DATA_SIZE (2UL * 1024 * 1024) >>>> >>>> int main(int argc, char **argv) >>>> { >>>> /* make 32 concurrent alloc and free of mTHP */ >>>> fork(); fork(); fork(); fork(); fork(); >>>> >>>> for (int i = 0; i < 100000; i++) { >>>> void *addr = mmap(NULL, DATA_SIZE, PROT_READ | >>>> PROT_WRITE, >>>> MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); >>>> if (addr == MAP_FAILED) { >>>> perror("fail to malloc"); >>>> return -1; >>>> } >>>> memset(addr, 0x11, DATA_SIZE); >>>> munmap(addr, DATA_SIZE); >>>> } >>>> >>>> return 0; >>>> } >>>> >> >> Rebased on next-20240415, >> >> echo never > /sys/kernel/mm/transparent_hugepage/enabled >> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled >> >> Compare with >> echo 0 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled >> echo 1 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled >> >>> >>> 1) PCP disabled >>> 1 2 3 4 5 average >>> real 200.41 202.18 203.16 201.54 200.91 201.64 >>> user 6.49 6.21 6.25 6.31 6.35 6.322 >>> sys 193.3 195.39 196.3 194.65 194.01 194.73 >>> >>> 2) PCP enabled >>> real 198.25 199.26 195.51 199.28 189.12 196.284 >>> -2.66% >>> user 6.21 6.02 6.02 6.28 6.21 6.148 -2.75% >>> sys 191.46 192.64 188.96 192.47 182.39 189.584 >>> -2.64% >>> >>> for above test, time reduce 2.x% > > This is an improvement from 0.28%, but it's still below my expectations. Yes, it's noise. Maybe we need a system with more Cores/Sockets? But it does feel a bit like we're trying to come up with the problem after we have a solution; I'd have thought some existing benchmark could highlight if that is worth it.
On 2024/4/16 15:03, David Hildenbrand wrote: > On 16.04.24 07:26, Barry Song wrote: >> On Tue, Apr 16, 2024 at 4:58 PM Kefeng Wang >> <wangkefeng.wang@huawei.com> wrote: >>> >>> >>> >>> On 2024/4/16 12:50, Kefeng Wang wrote: >>>> >>>> >>>> On 2024/4/16 8:21, Barry Song wrote: >>>>> On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang >>>>> <wangkefeng.wang@huawei.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 2024/4/15 18:52, David Hildenbrand wrote: >>>>>>> On 15.04.24 10:59, Kefeng Wang wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 2024/4/15 16:18, Barry Song wrote: >>>>>>>>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang >>>>>>>>> <wangkefeng.wang@huawei.com> wrote: >>>>>>>>>> >>>>>>>>>> Both the file pages and anonymous pages support large folio, >>>>>>>>>> high-order >>>>>>>>>> pages except PMD_ORDER will also be allocated frequently which >>>>>>>>>> could >>>>>>>>>> increase the zone lock contention, allow high-order pages on pcp >>>>>>>>>> lists >>>>>>>>>> could reduce the big zone lock contention, but as commit >>>>>>>>>> 44042b449872 >>>>>>>>>> ("mm/page_alloc: allow high-order pages to be stored on the >>>>>>>>>> per-cpu >>>>>>>>>> lists") >>>>>>>>>> pointed, it may not win in all the scenes, add a new control >>>>>>>>>> sysfs to >>>>>>>>>> enable or disable specified high-order pages stored on PCP lists, >>>>>>>>>> the order >>>>>>>>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP >>>>>>>>>> list by >>>>>>>>>> default. >>>>>>>>> >>>>>>>>> This is precisely something Baolin and I have discussed and >>>>>>>>> intended >>>>>>>>> to implement[1], >>>>>>>>> but unfortunately, we haven't had the time to do so. >>>>>>>> >>>>>>>> Indeed, same thing. Recently, we are working on unixbench/lmbench >>>>>>>> optimization, I tested Multi-size THP for anonymous memory by >>>>>>>> hard-cord >>>>>>>> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some >>>>>>>> improvement but >>>>>>>> not for all cases and not very stable, so re-implemented it by >>>>>>>> according >>>>>>>> to the user requirement and enable it dynamically. >>>>>>> >>>>>>> I'm wondering, though, if this is really a suitable candidate for a >>>>>>> sysctl toggle. Can anybody really come up with an educated guess for >>>>>>> these values? >>>>>> >>>>>> Not sure this is suitable in sysctl, but mTHP anon is enabled in >>>>>> sysctl, >>>>>> we could trace __alloc_pages() and do order statistic to decide to >>>>>> choose the high-order to be enabled on PCP. >>>>>> >>>>>>> >>>>>>> Especially reading "Benchmarks Score shows a little >>>>>>> improvoment(0.28%)" >>>>>>> and "it may not win in all the scenes", to me it mostly sounds like >>>>>>> "minimal impact" -- so who cares? >>>>>> >>>>>> Even though lock conflicts are eliminated, there is very limited >>>>>> performance improvement(even maybe fluctuation), it is not a good >>>>>> testcase to show improvement, just show the zone-lock issue, we >>>>>> need to >>>>>> find other better testcase, maybe some test on Andriod(heavy use >>>>>> 64K, no >>>>>> PMD THP), or LKP maybe give some help? >>>>>> >>>>>> I will try to find other testcase to show the benefit. >>>>> >>>>> Hi Kefeng, >>>>> >>>>> I wonder if you will see some major improvements on mTHP 64KiB using >>>>> the below microbench I wrote just now, for example perf and time to >>>>> finish the program >>>>> >>>>> #define DATA_SIZE (2UL * 1024 * 1024) >>>>> >>>>> int main(int argc, char **argv) >>>>> { >>>>> /* make 32 concurrent alloc and free of mTHP */ >>>>> fork(); fork(); fork(); fork(); fork(); >>>>> >>>>> for (int i = 0; i < 100000; i++) { >>>>> void *addr = mmap(NULL, DATA_SIZE, PROT_READ | >>>>> PROT_WRITE, >>>>> MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); >>>>> if (addr == MAP_FAILED) { >>>>> perror("fail to malloc"); >>>>> return -1; >>>>> } >>>>> memset(addr, 0x11, DATA_SIZE); >>>>> munmap(addr, DATA_SIZE); >>>>> } >>>>> >>>>> return 0; >>>>> } >>>>> >>> >>> Rebased on next-20240415, >>> >>> echo never > /sys/kernel/mm/transparent_hugepage/enabled >>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled >>> >>> Compare with >>> echo 0 > >>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled >>> echo 1 > >>> /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled >>> >>>> >>>> 1) PCP disabled >>>> 1 2 3 4 5 average >>>> real 200.41 202.18 203.16 201.54 200.91 201.64 >>>> user 6.49 6.21 6.25 6.31 6.35 6.322 >>>> sys 193.3 195.39 196.3 194.65 194.01 194.73 >>>> >>>> 2) PCP enabled >>>> real 198.25 199.26 195.51 199.28 189.12 196.284 >>>> -2.66% >>>> user 6.21 6.02 6.02 6.28 6.21 6.148 -2.75% >>>> sys 191.46 192.64 188.96 192.47 182.39 189.584 >>>> -2.64% >>>> >>>> for above test, time reduce 2.x% >> >> This is an improvement from 0.28%, but it's still below my expectations. > > Yes, it's noise. Maybe we need a system with more Cores/Sockets? But it > does feel a bit like we're trying to come up with the problem after we > have a solution; I'd have thought some existing benchmark could > highlight if that is worth it. 96 core, with 129 threads, a quick test with pcp_enabled to control hugepages-2048KB, it is no big improvement on 2M PCP enabled 1 2 3 average real 221.8 225.6 221.5 222.9666667 user 14.91 14.91 17.05 15.62333333 sys 141.91 159.25 156.23 152.4633333 PCP disabled real 230.76 231.39 228.39 230.18 user 15.47 15.88 17.5 16.28333333 sys 159.07 162.32 159.09 160.16 From 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists"), it seems limited improve, netperf-udp 5.13.0-rc2 5.13.0-rc2 mm-pcpburst-v3r4 mm-pcphighorder-v1r7 Hmean send-64 261.46 ( 0.00%) 266.30 * 1.85%* Hmean send-128 516.35 ( 0.00%) 536.78 * 3.96%* Hmean send-256 1014.13 ( 0.00%) 1034.63 * 2.02%* Hmean send-1024 3907.65 ( 0.00%) 4046.11 * 3.54%* Hmean send-2048 7492.93 ( 0.00%) 7754.85 * 3.50%* Hmean send-3312 11410.04 ( 0.00%) 11772.32 * 3.18%* Hmean send-4096 13521.95 ( 0.00%) 13912.34 * 2.89%* Hmean send-8192 21660.50 ( 0.00%) 22730.72 * 4.94%* Hmean send-16384 31902.32 ( 0.00%) 32637.50 * 2.30%*