Message ID | 20230403052233.1880567-1-ankur.a.arora@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | x86/clear_huge_page: multi-page clearing | expand |
On 4/3/2023 10:52 AM, Ankur Arora wrote: > This series introduces multi-page clearing for hugepages. > > This is a follow up of some of the ideas discussed at: > https://lore.kernel.org/lkml/CAHk-=wj9En-BC4t7J9xFZOws5ShwaR9yor7FxHZr8CTVyEP_+Q@mail.gmail.com/ > > On x86 page clearing is typically done via string intructions. These, > unlike a MOV loop, allow us to explicitly advertise the region-size to > the processor, which could serve as a hint to current (and/or > future) uarchs to elide cacheline allocation. > > In current generation processors, Milan (and presumably other Zen > variants) use the hint to elide cacheline allocation (for > region-size > LLC-size.) > > An additional reason for doing this is that string instructions are typically > microcoded, and clearing in bigger chunks than the current page-at-a- > time logic amortizes some of the cost. > > All uarchs tested (Milan, Icelakex, Skylakex) showed improved performance. > > There are, however, some problems: > > 1. extended zeroing periods means there's an increased latency due to > the now missing preemption points. > > That's handled in patches 7, 8, 9: > "sched: define TIF_ALLOW_RESCHED" > "irqentry: define irqentry_exit_allow_resched()" > "x86/clear_huge_page: make clear_contig_region() preemptible" > by the context marking itself reschedulable, and rescheduling in > irqexit context if needed (for PREEMPTION_NONE/_VOLUNTARY.) > > 2. the current page-at-a-time clearing logic does left-right narrowing > towards the faulting page which benefits workloads by maintaining > cache locality for workloads which have a sequential pattern. Clearing > in large chunks loses that. > > Some (but not all) of that could be ameliorated by something like > this patch: > https://lore.kernel.org/lkml/20220606203725.1313715-1-ankur.a.arora@oracle.com/ > > But, before doing that I'd like some comments on whether that is > worth doing for this specific use case? > > Rest of the series: > Patches 1, 2, 3: > "huge_pages: get rid of process_huge_page()" > "huge_page: get rid of {clear,copy}_subpage()" > "huge_page: allow arch override for clear/copy_huge_page()" > are mechanical and they simplify some of the current clear_huge_page() > logic. > > Patches 4, 5: > "x86/clear_page: parameterize clear_page*() to specify length" > "x86/clear_pages: add clear_pages()" > > add clear_pages() and helpers. > > Patch 6: "mm/clear_huge_page: use multi-page clearing" adds the > chunked x86 clear_huge_page() implementation. > > > Performance > == > > Demand fault performance gets a decent boost: > > *Icelakex* mm/clear_huge_page x86/clear_huge_page change > (GB/s) (GB/s) > > pg-sz=2MB 8.76 11.82 +34.93% > pg-sz=1GB 8.99 12.18 +35.48% > > > *Milan* mm/clear_huge_page x86/clear_huge_page change > (GB/s) (GB/s) > > pg-sz=2MB 12.24 17.54 +43.30% > pg-sz=1GB 17.98 37.24 +107.11% > > > vm-scalability/case-anon-w-seq-hugetlb, gains in stime but performs > worse when user space tries to touch those pages: > > *Icelakex* mm/clear_huge_page x86/clear_huge_page change > (mem=4GB/task, tasks=128) > > stime 293.02 +- .49% 239.39 +- .83% -18.30% > utime 440.11 +- .28% 508.74 +- .60% +15.59% > wall-clock 5.96 +- .33% 6.27 +-2.23% + 5.20% > > > *Milan* mm/clear_huge_page x86/clear_huge_page change > (mem=1GB/task, tasks=512) > > stime 490.95 +- 3.55% 466.90 +- 4.79% - 4.89% > utime 276.43 +- 2.85% 311.97 +- 5.15% +12.85% > wall-clock 3.74 +- 6.41% 3.58 +- 7.82% - 4.27% > > Also at: > github.com/terminus/linux clear-pages.v1 > > Comments appreciated! > Hello Ankur, Was able to test your patches. To summarize, am seeing 2x-3x perf improvement for 2M, 1GB base hugepage sizes. SUT: Genoa AMD EPYC Thread(s) per core: 2 Core(s) per socket: 128 Socket(s): 2 NUMA: NUMA node(s): 2 NUMA node0 CPU(s): 0-127,256-383 NUMA node1 CPU(s): 128-255,384-511 Test: Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA node0), for both base-hugepage-size=2M and 1GB perf stat -r 10 -d -d numactl -m 0 -N 0 <test> time in seconds elapsed (average of 10 runs) (lower = better) Result: page-size mm/clear_huge_page x86/clear_huge_page change % 2M 5.4567 2.6774 -50.93 1G 2.64452 1.011281 -61.76 Full perfstat info page size = 2M mm/clear_huge_page Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10 runs): 5,434.71 msec task-clock # 0.996 CPUs utilized ( +- 0.55% ) 8 context-switches # 1.466 /sec ( +- 4.66% ) 0 cpu-migrations # 0.000 /sec 32,918 page-faults # 6.034 K/sec ( +- 0.00% ) 16,977,242,482 cycles # 3.112 GHz ( +- 0.04% ) (35.70%) 1,961,724 stalled-cycles-frontend # 0.01% frontend cycles idle ( +- 1.09% ) (35.72%) 35,685,674 stalled-cycles-backend # 0.21% backend cycles idle ( +- 3.48% ) (35.74%) 1,038,327,182 instructions # 0.06 insn per cycle # 0.04 stalled cycles per insn ( +- 0.38% ) (35.75%) 221,409,216 branches # 40.584 M/sec ( +- 0.36% ) (35.75%) 350,730 branch-misses # 0.16% of all branches ( +- 1.18% ) (35.75%) 2,520,888,779 L1-dcache-loads # 462.077 M/sec ( +- 0.03% ) (35.73%) 1,094,178,209 L1-dcache-load-misses # 43.46% of all L1-dcache accesses ( +- 0.02% ) (35.71%) 67,751,730 L1-icache-loads # 12.419 M/sec ( +- 0.11% ) (35.70%) 271,118 L1-icache-load-misses # 0.40% of all L1-icache accesses ( +- 2.55% ) (35.70%) 506,635 dTLB-loads # 92.866 K/sec ( +- 3.31% ) (35.70%) 237,385 dTLB-load-misses # 43.64% of all dTLB cache accesses ( +- 7.00% ) (35.69%) 268 iTLB-load-misses # 6700.00% of all iTLB cache accesses ( +- 13.86% ) (35.70%) 5.4567 +- 0.0300 seconds time elapsed ( +- 0.55% ) page size = 2M x86/clear_huge_page Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10 runs): 2,780.69 msec task-clock # 1.039 CPUs utilized ( +- 1.03% ) 3 context-switches # 1.121 /sec ( +- 21.34% ) 0 cpu-migrations # 0.000 /sec 32,918 page-faults # 12.301 K/sec ( +- 0.00% ) 8,143,619,771 cycles # 3.043 GHz ( +- 0.25% ) (35.62%) 2,024,872 stalled-cycles-frontend # 0.02% frontend cycles idle ( +-320.93% ) (35.66%) 717,198,728 stalled-cycles-backend # 8.82% backend cycles idle ( +- 8.26% ) (35.69%) 606,549,334 instructions # 0.07 insn per cycle # 1.39 stalled cycles per insn ( +- 0.23% ) (35.73%) 108,856,550 branches # 40.677 M/sec ( +- 0.24% ) (35.76%) 202,490 branch-misses # 0.18% of all branches ( +- 3.58% ) (35.78%) 2,348,818,806 L1-dcache-loads # 877.701 M/sec ( +- 0.03% ) (35.78%) 1,081,562,988 L1-dcache-load-misses # 46.04% of all L1-dcache accesses ( +- 0.01% ) (35.78%) <not supported> LLC-loads <not supported> LLC-load-misses 43,411,167 L1-icache-loads # 16.222 M/sec ( +- 0.19% ) (35.77%) 273,042 L1-icache-load-misses # 0.64% of all L1-icache accesses ( +- 4.94% ) (35.76%) 834,482 dTLB-loads # 311.827 K/sec ( +- 9.73% ) (35.72%) 437,343 dTLB-load-misses # 65.86% of all dTLB cache accesses ( +- 8.56% ) (35.68%) 0 iTLB-loads # 0.000 /sec (35.65%) 160 iTLB-load-misses # 1777.78% of all iTLB cache accesses ( +- 15.82% ) (35.62%) 2.6774 +- 0.0287 seconds time elapsed ( +- 1.07% ) page size = 1G mm/clear_huge_page Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs): 2,625.24 msec task-clock # 0.993 CPUs utilized ( +- 0.23% ) 4 context-switches # 1.513 /sec ( +- 4.49% ) 1 cpu-migrations # 0.378 /sec 214 page-faults # 80.965 /sec ( +- 0.13% ) 8,178,624,349 cycles # 3.094 GHz ( +- 0.23% ) (35.65%) 2,942,576 stalled-cycles-frontend # 0.04% frontend cycles idle ( +- 75.22% ) (35.69%) 7,117,425 stalled-cycles-backend # 0.09% backend cycles idle ( +- 3.79% ) (35.73%) 454,521,647 instructions # 0.06 insn per cycle # 0.02 stalled cycles per insn ( +- 0.10% ) (35.77%) 113,223,853 branches # 42.837 M/sec ( +- 0.08% ) (35.80%) 84,766 branch-misses # 0.07% of all branches ( +- 5.37% ) (35.80%) 2,294,528,890 L1-dcache-loads # 868.111 M/sec ( +- 0.02% ) (35.81%) 1,075,907,551 L1-dcache-load-misses # 46.88% of all L1-dcache accesses ( +- 0.02% ) (35.78%) 26,167,323 L1-icache-loads # 9.900 M/sec ( +- 0.24% ) (35.74%) 139,675 L1-icache-load-misses # 0.54% of all L1-icache accesses ( +- 0.37% ) (35.70%) 3,459 dTLB-loads # 1.309 K/sec ( +- 12.75% ) (35.67%) 732 dTLB-load-misses # 19.71% of all dTLB cache accesses ( +- 26.61% ) (35.62%) 11 iTLB-load-misses # 192.98% of all iTLB cache accesses ( +-238.28% ) (35.62%) 2.64452 +- 0.00600 seconds time elapsed ( +- 0.23% ) page size = 1G x86/clear_huge_page Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs): 1,009.09 msec task-clock # 0.998 CPUs utilized ( +- 0.06% ) 2 context-switches # 1.980 /sec ( +- 23.63% ) 1 cpu-migrations # 0.990 /sec 214 page-faults # 211.887 /sec ( +- 0.16% ) 3,154,980,463 cycles # 3.124 GHz ( +- 0.06% ) (35.77%) 145,051 stalled-cycles-frontend # 0.00% frontend cycles idle ( +- 6.26% ) (35.78%) 730,087,143 stalled-cycles-backend # 23.12% backend cycles idle ( +- 9.75% ) (35.78%) 45,813,391 instructions # 0.01 insn per cycle # 18.51 stalled cycles per insn ( +- 1.00% ) (35.78%) 8,498,282 branches # 8.414 M/sec ( +- 1.54% ) (35.78%) 63,351 branch-misses # 0.74% of all branches ( +- 6.70% ) (35.69%) 29,135,863 L1-dcache-loads # 28.848 M/sec ( +- 5.67% ) (35.68%) 8,537,280 L1-dcache-load-misses # 28.66% of all L1-dcache accesses ( +- 10.15% ) (35.68%) 1,040,087 L1-icache-loads # 1.030 M/sec ( +- 1.60% ) (35.68%) 9,147 L1-icache-load-misses # 0.85% of all L1-icache accesses ( +- 6.50% ) (35.67%) 1,084 dTLB-loads # 1.073 K/sec ( +- 12.05% ) (35.68%) 431 dTLB-load-misses # 40.28% of all dTLB cache accesses ( +- 43.46% ) (35.68%) 16 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 40.54% ) (35.68%) 1.011281 +- 0.000624 seconds time elapsed ( +- 0.06% ) Please feel free to add Tested-by: Raghavendra K T <raghavendra.kt@amd.com> Will come back with further observations on patch/performance if any Thanks and Regards
Raghavendra K T <raghavendra.kt@amd.com> writes: > On 4/3/2023 10:52 AM, Ankur Arora wrote: >> This series introduces multi-page clearing for hugepages. > *Milan* mm/clear_huge_page x86/clear_huge_page change > (GB/s) (GB/s) > pg-sz=2MB 12.24 17.54 +43.30% > pg-sz=1GB 17.98 37.24 +107.11% > > > Hello Ankur, > > Was able to test your patches. To summarize, am seeing 2x-3x perf > improvement for 2M, 1GB base hugepage sizes. Great. Thanks Raghavendra. > SUT: Genoa AMD EPYC > Thread(s) per core: 2 > Core(s) per socket: 128 > Socket(s): 2 > > NUMA: > NUMA node(s): 2 > NUMA node0 CPU(s): 0-127,256-383 > NUMA node1 CPU(s): 128-255,384-511 > > Test: Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA node0), for > both base-hugepage-size=2M and 1GB > > perf stat -r 10 -d -d numactl -m 0 -N 0 <test> > > time in seconds elapsed (average of 10 runs) (lower = better) > > Result: > page-size mm/clear_huge_page x86/clear_huge_page > 2M 5.4567 2.6774 > 1G 2.64452 1.011281 So translating into BW, for Genoa we have: page-size mm/clear_huge_page x86/clear_huge_page 2M 11.74 23.97 1G 24.24 63.36 That's a pretty good bump over Milan: > *Milan* mm/clear_huge_page x86/clear_huge_page > (GB/s) (GB/s) > pg-sz=2MB 12.24 17.54 > pg-sz=1GB 17.98 37.24 Btw, are these numbers with boost=1? > Full perfstat info > > page size = 2M mm/clear_huge_page > > Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10 runs): > > 5,434.71 msec task-clock # 0.996 CPUs utilized > ( +- 0.55% ) > 8 context-switches # 1.466 /sec > ( +- 4.66% ) > 0 cpu-migrations # 0.000 /sec > 32,918 page-faults # 6.034 K/sec > ( +- 0.00% ) > 16,977,242,482 cycles # 3.112 GHz > ( +- 0.04% ) (35.70%) > 1,961,724 stalled-cycles-frontend # 0.01% frontend cycles > idle ( +- 1.09% ) (35.72%) > 35,685,674 stalled-cycles-backend # 0.21% backend cycles idle > ( +- 3.48% ) (35.74%) > 1,038,327,182 instructions # 0.06 insn per cycle > # 0.04 stalled cycles per > insn ( +- 0.38% ) > (35.75%) > 221,409,216 branches # 40.584 M/sec > ( +- 0.36% ) (35.75%) > 350,730 branch-misses # 0.16% of all branches > ( +- 1.18% ) (35.75%) > 2,520,888,779 L1-dcache-loads # 462.077 M/sec > ( +- 0.03% ) (35.73%) > 1,094,178,209 L1-dcache-load-misses # 43.46% of all L1-dcache > accesses ( +- 0.02% ) (35.71%) > 67,751,730 L1-icache-loads # 12.419 M/sec > ( +- 0.11% ) (35.70%) > 271,118 L1-icache-load-misses # 0.40% of all L1-icache > accesses ( +- 2.55% ) (35.70%) > 506,635 dTLB-loads # 92.866 K/sec > ( +- 3.31% ) (35.70%) > 237,385 dTLB-load-misses # 43.64% of all dTLB cache > accesses ( +- 7.00% ) (35.69%) > 268 iTLB-load-misses # 6700.00% of all iTLB cache > accesses ( +- 13.86% ) (35.70%) > > 5.4567 +- 0.0300 seconds time elapsed ( +- 0.55% ) > > page size = 2M x86/clear_huge_page > Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10 runs): > > 2,780.69 msec task-clock # 1.039 CPUs utilized > ( +- 1.03% ) > 3 context-switches # 1.121 /sec > ( +- 21.34% ) > 0 cpu-migrations # 0.000 /sec > 32,918 page-faults # 12.301 K/sec > ( +- 0.00% ) > 8,143,619,771 cycles # 3.043 GHz > ( +- 0.25% ) (35.62%) > 2,024,872 stalled-cycles-frontend # 0.02% frontend cycles > idle ( +-320.93% ) (35.66%) > 717,198,728 stalled-cycles-backend # 8.82% backend cycles idle > ( +- 8.26% ) (35.69%) > 606,549,334 instructions # 0.07 insn per cycle > # 1.39 stalled cycles per > insn ( +- 0.23% ) > (35.73%) > 108,856,550 branches # 40.677 M/sec > ( +- 0.24% ) (35.76%) > 202,490 branch-misses # 0.18% of all branches > ( +- 3.58% ) (35.78%) > 2,348,818,806 L1-dcache-loads # 877.701 M/sec > ( +- 0.03% ) (35.78%) > 1,081,562,988 L1-dcache-load-misses # 46.04% of all L1-dcache > accesses ( +- 0.01% ) (35.78%) > <not supported> LLC-loads > <not supported> LLC-load-misses > 43,411,167 L1-icache-loads # 16.222 M/sec > ( +- 0.19% ) (35.77%) > 273,042 L1-icache-load-misses # 0.64% of all L1-icache > accesses ( +- 4.94% ) (35.76%) > 834,482 dTLB-loads # 311.827 K/sec > ( +- 9.73% ) (35.72%) > 437,343 dTLB-load-misses # 65.86% of all dTLB cache > accesses ( +- 8.56% ) (35.68%) > 0 iTLB-loads # 0.000 /sec > (35.65%) > 160 iTLB-load-misses # 1777.78% of all iTLB cache > accesses ( +- 15.82% ) (35.62%) > > 2.6774 +- 0.0287 seconds time elapsed ( +- 1.07% ) > > page size = 1G mm/clear_huge_page > Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs): > > 2,625.24 msec task-clock # 0.993 CPUs utilized > ( +- 0.23% ) > 4 context-switches # 1.513 /sec > ( +- 4.49% ) > 1 cpu-migrations # 0.378 /sec > 214 page-faults # 80.965 /sec > ( +- 0.13% ) > 8,178,624,349 cycles # 3.094 GHz > ( +- 0.23% ) (35.65%) > 2,942,576 stalled-cycles-frontend # 0.04% frontend cycles > idle ( +- 75.22% ) (35.69%) > 7,117,425 stalled-cycles-backend # 0.09% backend cycles idle > ( +- 3.79% ) (35.73%) > 454,521,647 instructions # 0.06 insn per cycle > # 0.02 stalled cycles per > insn ( +- 0.10% ) > (35.77%) > 113,223,853 branches # 42.837 M/sec > ( +- 0.08% ) (35.80%) > 84,766 branch-misses # 0.07% of all branches > ( +- 5.37% ) (35.80%) > 2,294,528,890 L1-dcache-loads # 868.111 M/sec > ( +- 0.02% ) (35.81%) > 1,075,907,551 L1-dcache-load-misses # 46.88% of all L1-dcache > accesses ( +- 0.02% ) (35.78%) > 26,167,323 L1-icache-loads # 9.900 M/sec > ( +- 0.24% ) (35.74%) > 139,675 L1-icache-load-misses # 0.54% of all L1-icache > accesses ( +- 0.37% ) (35.70%) > 3,459 dTLB-loads # 1.309 K/sec > ( +- 12.75% ) (35.67%) > 732 dTLB-load-misses # 19.71% of all dTLB cache > accesses ( +- 26.61% ) (35.62%) > 11 iTLB-load-misses # 192.98% of all iTLB cache > accesses ( +-238.28% ) (35.62%) > > 2.64452 +- 0.00600 seconds time elapsed ( +- 0.23% ) > > > page size = 1G x86/clear_huge_page > Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs): > > 1,009.09 msec task-clock # 0.998 CPUs utilized > ( +- 0.06% ) > 2 context-switches # 1.980 /sec > ( +- 23.63% ) > 1 cpu-migrations # 0.990 /sec > 214 page-faults # 211.887 /sec > ( +- 0.16% ) > 3,154,980,463 cycles # 3.124 GHz > ( +- 0.06% ) (35.77%) > 145,051 stalled-cycles-frontend # 0.00% frontend cycles > idle ( +- 6.26% ) (35.78%) > 730,087,143 stalled-cycles-backend # 23.12% backend cycles idle > ( +- 9.75% ) (35.78%) > 45,813,391 instructions # 0.01 insn per cycle > # 18.51 stalled cycles per > insn ( +- 1.00% ) > (35.78%) > 8,498,282 branches # 8.414 M/sec > ( +- 1.54% ) (35.78%) > 63,351 branch-misses # 0.74% of all branches > ( +- 6.70% ) (35.69%) > 29,135,863 L1-dcache-loads # 28.848 M/sec > ( +- 5.67% ) (35.68%) > 8,537,280 L1-dcache-load-misses # 28.66% of all L1-dcache > accesses ( +- 10.15% ) (35.68%) > 1,040,087 L1-icache-loads # 1.030 M/sec > ( +- 1.60% ) (35.68%) > 9,147 L1-icache-load-misses # 0.85% of all L1-icache > accesses ( +- 6.50% ) (35.67%) > 1,084 dTLB-loads # 1.073 K/sec > ( +- 12.05% ) (35.68%) > 431 dTLB-load-misses # 40.28% of all dTLB cache > accesses ( +- 43.46% ) (35.68%) > 16 iTLB-load-misses # 0.00% of all iTLB cache > accesses ( +- 40.54% ) (35.68%) > > 1.011281 +- 0.000624 seconds time elapsed ( +- 0.06% ) > > Please feel free to add > > Tested-by: Raghavendra K T <raghavendra.kt@amd.com> Thanks Ankur > Will come back with further observations on patch/performance if any
On 4/9/2023 4:16 AM, Ankur Arora wrote: > > Raghavendra K T <raghavendra.kt@amd.com> writes: > >> On 4/3/2023 10:52 AM, Ankur Arora wrote: >>> This series introduces multi-page clearing for hugepages. > >> *Milan* mm/clear_huge_page x86/clear_huge_page change >> (GB/s) (GB/s) >> pg-sz=2MB 12.24 17.54 +43.30% >> pg-sz=1GB 17.98 37.24 +107.11% >> >> >> Hello Ankur, >> >> Was able to test your patches. To summarize, am seeing 2x-3x perf >> improvement for 2M, 1GB base hugepage sizes. > > Great. Thanks Raghavendra. > >> SUT: Genoa AMD EPYC >> Thread(s) per core: 2 >> Core(s) per socket: 128 >> Socket(s): 2 >> >> NUMA: >> NUMA node(s): 2 >> NUMA node0 CPU(s): 0-127,256-383 >> NUMA node1 CPU(s): 128-255,384-511 >> >> Test: Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA node0), for >> both base-hugepage-size=2M and 1GB >> >> perf stat -r 10 -d -d numactl -m 0 -N 0 <test> >> >> time in seconds elapsed (average of 10 runs) (lower = better) >> >> Result: >> page-size mm/clear_huge_page x86/clear_huge_page >> 2M 5.4567 2.6774 >> 1G 2.64452 1.011281 > > So translating into BW, for Genoa we have: > > page-size mm/clear_huge_page x86/clear_huge_page > 2M 11.74 23.97 > 1G 24.24 63.36 > > That's a pretty good bump over Milan: > >> *Milan* mm/clear_huge_page x86/clear_huge_page >> (GB/s) (GB/s) >> pg-sz=2MB 12.24 17.54 >> pg-sz=1GB 17.98 37.24 > > Btw, are these numbers with boost=1? > Yes it is. Also a note about config. I had not enabled GCOV/LOCKSTAT related config because I faced some issues.