Message ID | 20230127194110.533103-1-surenb@google.com (mailing list archive) |
---|---|
Headers | show |
Series | Per-VMA locks | expand |
On Fri, 27 Jan 2023 11:40:37 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > Per-vma locks idea that was discussed during SPF [1] discussion at LSF/MM > last year [2], which concluded with suggestion that “a reader/writer > semaphore could be put into the VMA itself; that would have the effect of > using the VMA as a sort of range lock. There would still be contention at > the VMA level, but it would be an improvement.” This patchset implements > this suggested approach. I think I'll await reviewer/tester input for a while. > The patchset implements per-VMA locking only for anonymous pages which > are not in swap and avoids userfaultfs as their implementation is more > complex. Additional support for file-back page faults, swapped and user > pages can be added incrementally. This is a significant risk. How can we be confident that these as yet unimplemented parts are implementable and that the result will be good?
On Fri, Jan 27, 2023 at 02:51:38PM -0800, Andrew Morton wrote: > On Fri, 27 Jan 2023 11:40:37 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > Per-vma locks idea that was discussed during SPF [1] discussion at LSF/MM > > last year [2], which concluded with suggestion that “a reader/writer > > semaphore could be put into the VMA itself; that would have the effect of > > using the VMA as a sort of range lock. There would still be contention at > > the VMA level, but it would be an improvement.” This patchset implements > > this suggested approach. > > I think I'll await reviewer/tester input for a while. > > > The patchset implements per-VMA locking only for anonymous pages which > > are not in swap and avoids userfaultfs as their implementation is more > > complex. Additional support for file-back page faults, swapped and user > > pages can be added incrementally. > > This is a significant risk. How can we be confident that these as yet > unimplemented parts are implementable and that the result will be good? They don't need to be implementable for this patchset to be evaluated on its own terms. This patchset improves scalability for anon pages without making file/swap/uffd pages worse (or if it does, I haven't seen the benchmarks to prove it). That said, I'm confident that I have a good handle on how to make file-backed page faults work under RCU.
On Fri, Jan 27, 2023 at 3:26 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Jan 27, 2023 at 02:51:38PM -0800, Andrew Morton wrote: > > On Fri, 27 Jan 2023 11:40:37 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > > > Per-vma locks idea that was discussed during SPF [1] discussion at LSF/MM > > > last year [2], which concluded with suggestion that “a reader/writer > > > semaphore could be put into the VMA itself; that would have the effect of > > > using the VMA as a sort of range lock. There would still be contention at > > > the VMA level, but it would be an improvement.” This patchset implements > > > this suggested approach. > > > > I think I'll await reviewer/tester input for a while. Sure, I don't expect the review to be very quick considering the complexity, however I would appreciate any testing that can be done. > > > > > The patchset implements per-VMA locking only for anonymous pages which > > > are not in swap and avoids userfaultfs as their implementation is more > > > complex. Additional support for file-back page faults, swapped and user > > > pages can be added incrementally. > > > > This is a significant risk. How can we be confident that these as yet > > unimplemented parts are implementable and that the result will be good? > > They don't need to be implementable for this patchset to be evaluated > on its own terms. This patchset improves scalability for anon pages > without making file/swap/uffd pages worse (or if it does, I haven't > seen the benchmarks to prove it). Making it work for all kinds of page faults would require much more time. So, this incremental approach, when we tackle the mmap_lock scalability problem part-by-part seems more doable. Even with anonymous-only support, the patch shows considerable improvements. Therefore I would argue that the patch is viable even if it does not support the above-mentioned cases. > > That said, I'm confident that I have a good handle on how to make > file-backed page faults work under RCU. Looking forward to collaborating on that! Thanks, Suren.
On Fri, Jan 27, 2023 at 4:00 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Fri, Jan 27, 2023 at 3:26 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Fri, Jan 27, 2023 at 02:51:38PM -0800, Andrew Morton wrote: > > > On Fri, 27 Jan 2023 11:40:37 -0800 Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > Per-vma locks idea that was discussed during SPF [1] discussion at LSF/MM > > > > last year [2], which concluded with suggestion that “a reader/writer > > > > semaphore could be put into the VMA itself; that would have the effect of > > > > using the VMA as a sort of range lock. There would still be contention at > > > > the VMA level, but it would be an improvement.” This patchset implements > > > > this suggested approach. > > > > > > I think I'll await reviewer/tester input for a while. Over the last two weeks I did not receive any feedback on the mailing list but off-list a couple of people reported positive results in their tests and Punit reported a regression on his NUMA machine when running pft-threads workload. I found the source of that regression and have two small fixes which were confirmed to improve the performance (hopefully Punit will share the results here). I'm planning to post v3 sometime this week. If anyone has additional feedback, please let me know soon so that I can address it in the v3. Thanks, Suren. > > Sure, I don't expect the review to be very quick considering the > complexity, however I would appreciate any testing that can be done. > > > > > > > > The patchset implements per-VMA locking only for anonymous pages which > > > > are not in swap and avoids userfaultfs as their implementation is more > > > > complex. Additional support for file-back page faults, swapped and user > > > > pages can be added incrementally. > > > > > > This is a significant risk. How can we be confident that these as yet > > > unimplemented parts are implementable and that the result will be good? > > > > They don't need to be implementable for this patchset to be evaluated > > on its own terms. This patchset improves scalability for anon pages > > without making file/swap/uffd pages worse (or if it does, I haven't > > seen the benchmarks to prove it). > > Making it work for all kinds of page faults would require much more > time. So, this incremental approach, when we tackle the mmap_lock > scalability problem part-by-part seems more doable. Even with > anonymous-only support, the patch shows considerable improvements. > Therefore I would argue that the patch is viable even if it does not > support the above-mentioned cases. > > > > > That said, I'm confident that I have a good handle on how to make > > file-backed page faults work under RCU. > > Looking forward to collaborating on that! > Thanks, > Suren.
Suren Baghdasaryan <surenb@google.com> writes: > Previous version: > v1: https://lore.kernel.org/all/20230109205336.3665937-1-surenb@google.com/ > RFC: https://lore.kernel.org/all/20220901173516.702122-1-surenb@google.com/ > > LWN article describing the feature: > https://lwn.net/Articles/906852/ > > Per-vma locks idea that was discussed during SPF [1] discussion at LSF/MM > last year [2], which concluded with suggestion that “a reader/writer > semaphore could be put into the VMA itself; that would have the effect of > using the VMA as a sort of range lock. There would still be contention at > the VMA level, but it would be an improvement.” This patchset implements > this suggested approach. I took the patches for a spin on a 2-socket 32 core (64 threads) system with Intel 8336C (Ice Lake) and 512GB of RAM. For the initial testing, "pft-threads" from the mm-tests suite[0] was used. The test mmaps a memory region (~100GB on the test system) and triggers access by a number of threads executing in parallel. For each degree of parallelism, the test is repeated 10 times to get a better feel for the behaviour. Below is an excerpt of the harmonic mean reported by 'compare_kernel' script[1] included with mm-tests. The first column is results for mm-unstable as of 2023-02-10, the second column is the patches posted here while the third column includes optimizations to reclaim some of the observed regression. From the results, there is a drop in page fault/second for low number of CPUs but good improvement with higher CPUs. 6.2.0-rc4 6.2.0-rc4 6.2.0-rc4 mm-unstable-20230210 pvl-v2 pvl-v2+opt Hmean faults/cpu-1 898792.9338 ( 0.00%) 894597.0474 * -0.47%* 895933.2782 * -0.32%* Hmean faults/cpu-4 751903.9803 ( 0.00%) 677764.2975 * -9.86%* 688643.8163 * -8.41%* Hmean faults/cpu-7 612275.5663 ( 0.00%) 565363.4137 * -7.66%* 597538.9396 * -2.41%* Hmean faults/cpu-12 434460.9074 ( 0.00%) 410974.2708 * -5.41%* 452501.4290 * 4.15%* Hmean faults/cpu-21 291475.5165 ( 0.00%) 293936.8460 ( 0.84%) 308712.2434 * 5.91%* Hmean faults/cpu-30 218021.3980 ( 0.00%) 228265.0559 * 4.70%* 241897.5225 * 10.95%* Hmean faults/cpu-48 141798.5030 ( 0.00%) 162322.5972 * 14.47%* 166081.9459 * 17.13%* Hmean faults/cpu-79 90060.9577 ( 0.00%) 107028.7779 * 18.84%* 109810.4488 * 21.93%* Hmean faults/cpu-110 64729.3561 ( 0.00%) 80597.7246 * 24.51%* 83134.0679 * 28.43%* Hmean faults/cpu-128 55740.1334 ( 0.00%) 68395.4426 * 22.70%* 69248.2836 * 24.23%* Hmean faults/sec-1 898781.7694 ( 0.00%) 894247.3174 * -0.50%* 894440.3118 * -0.48%* Hmean faults/sec-4 2965588.9697 ( 0.00%) 2683651.5664 * -9.51%* 2726450.9710 * -8.06%* Hmean faults/sec-7 4144512.3996 ( 0.00%) 3891644.2128 * -6.10%* 4099918.8601 ( -1.08%) Hmean faults/sec-12 4969513.6934 ( 0.00%) 4829731.4355 * -2.81%* 5264682.7371 * 5.94%* Hmean faults/sec-21 5814379.4789 ( 0.00%) 5941405.3116 * 2.18%* 6263716.3903 * 7.73%* Hmean faults/sec-30 6153685.3709 ( 0.00%) 6489311.6634 * 5.45%* 6910843.5858 * 12.30%* Hmean faults/sec-48 6197953.1327 ( 0.00%) 7216320.7727 * 16.43%* 7412782.2927 * 19.60%* Hmean faults/sec-79 6167135.3738 ( 0.00%) 7425927.1022 * 20.41%* 7637042.2198 * 23.83%* Hmean faults/sec-110 6264768.2247 ( 0.00%) 7813329.3863 * 24.72%* 7984344.4005 * 27.45%* Hmean faults/sec-128 6460727.8216 ( 0.00%) 7875664.8999 * 21.90%* 8049910.3601 * 24.60%* [0] https://github.com/gormanm/mmtests [1] https://github.com/gormanm/mmtests/blob/master/compare-kernels.sh
On Wed, Feb 15, 2023 at 9:33 AM Punit Agrawal <punit.agrawal@bytedance.com> wrote: > > Suren Baghdasaryan <surenb@google.com> writes: > > > Previous version: > > v1: https://lore.kernel.org/all/20230109205336.3665937-1-surenb@google.com/ > > RFC: https://lore.kernel.org/all/20220901173516.702122-1-surenb@google.com/ > > > > LWN article describing the feature: > > https://lwn.net/Articles/906852/ > > > > Per-vma locks idea that was discussed during SPF [1] discussion at LSF/MM > > last year [2], which concluded with suggestion that “a reader/writer > > semaphore could be put into the VMA itself; that would have the effect of > > using the VMA as a sort of range lock. There would still be contention at > > the VMA level, but it would be an improvement.” This patchset implements > > this suggested approach. > > I took the patches for a spin on a 2-socket 32 core (64 threads) system > with Intel 8336C (Ice Lake) and 512GB of RAM. > > For the initial testing, "pft-threads" from the mm-tests suite[0] was > used. The test mmaps a memory region (~100GB on the test system) and > triggers access by a number of threads executing in parallel. For each > degree of parallelism, the test is repeated 10 times to get a better > feel for the behaviour. Below is an excerpt of the harmonic mean > reported by 'compare_kernel' script[1] included with mm-tests. > > The first column is results for mm-unstable as of 2023-02-10, the second > column is the patches posted here while the third column includes > optimizations to reclaim some of the observed regression. > > From the results, there is a drop in page fault/second for low number of > CPUs but good improvement with higher CPUs. > > 6.2.0-rc4 6.2.0-rc4 6.2.0-rc4 > mm-unstable-20230210 pvl-v2 pvl-v2+opt > > Hmean faults/cpu-1 898792.9338 ( 0.00%) 894597.0474 * -0.47%* 895933.2782 * -0.32%* > Hmean faults/cpu-4 751903.9803 ( 0.00%) 677764.2975 * -9.86%* 688643.8163 * -8.41%* > Hmean faults/cpu-7 612275.5663 ( 0.00%) 565363.4137 * -7.66%* 597538.9396 * -2.41%* > Hmean faults/cpu-12 434460.9074 ( 0.00%) 410974.2708 * -5.41%* 452501.4290 * 4.15%* > Hmean faults/cpu-21 291475.5165 ( 0.00%) 293936.8460 ( 0.84%) 308712.2434 * 5.91%* > Hmean faults/cpu-30 218021.3980 ( 0.00%) 228265.0559 * 4.70%* 241897.5225 * 10.95%* > Hmean faults/cpu-48 141798.5030 ( 0.00%) 162322.5972 * 14.47%* 166081.9459 * 17.13%* > Hmean faults/cpu-79 90060.9577 ( 0.00%) 107028.7779 * 18.84%* 109810.4488 * 21.93%* > Hmean faults/cpu-110 64729.3561 ( 0.00%) 80597.7246 * 24.51%* 83134.0679 * 28.43%* > Hmean faults/cpu-128 55740.1334 ( 0.00%) 68395.4426 * 22.70%* 69248.2836 * 24.23%* > > Hmean faults/sec-1 898781.7694 ( 0.00%) 894247.3174 * -0.50%* 894440.3118 * -0.48%* > Hmean faults/sec-4 2965588.9697 ( 0.00%) 2683651.5664 * -9.51%* 2726450.9710 * -8.06%* > Hmean faults/sec-7 4144512.3996 ( 0.00%) 3891644.2128 * -6.10%* 4099918.8601 ( -1.08%) > Hmean faults/sec-12 4969513.6934 ( 0.00%) 4829731.4355 * -2.81%* 5264682.7371 * 5.94%* > Hmean faults/sec-21 5814379.4789 ( 0.00%) 5941405.3116 * 2.18%* 6263716.3903 * 7.73%* > Hmean faults/sec-30 6153685.3709 ( 0.00%) 6489311.6634 * 5.45%* 6910843.5858 * 12.30%* > Hmean faults/sec-48 6197953.1327 ( 0.00%) 7216320.7727 * 16.43%* 7412782.2927 * 19.60%* > Hmean faults/sec-79 6167135.3738 ( 0.00%) 7425927.1022 * 20.41%* 7637042.2198 * 23.83%* > Hmean faults/sec-110 6264768.2247 ( 0.00%) 7813329.3863 * 24.72%* 7984344.4005 * 27.45%* > Hmean faults/sec-128 6460727.8216 ( 0.00%) 7875664.8999 * 21.90%* 8049910.3601 * 24.60%* Thanks for summarizing the findings, Punit! So, looks like the latest fixes I sent to you for testing (pvl-v2+opt) bring the regression down quite a bit. faults/sec-4 case is still regressing but the rest look quite good. I'll incorporate those fixes and post v3 shortly. Thanks! > > [0] https://github.com/gormanm/mmtests > [1] https://github.com/gormanm/mmtests/blob/master/compare-kernels.sh
Punit Agrawal <punit.agrawal@bytedance.com> writes: > Suren Baghdasaryan <surenb@google.com> writes: > >> Previous version: >> v1: https://lore.kernel.org/all/20230109205336.3665937-1-surenb@google.com/ >> RFC: https://lore.kernel.org/all/20220901173516.702122-1-surenb@google.com/ >> >> LWN article describing the feature: >> https://lwn.net/Articles/906852/ >> >> Per-vma locks idea that was discussed during SPF [1] discussion at LSF/MM >> last year [2], which concluded with suggestion that “a reader/writer >> semaphore could be put into the VMA itself; that would have the effect of >> using the VMA as a sort of range lock. There would still be contention at >> the VMA level, but it would be an improvement.” This patchset implements >> this suggested approach. > > I took the patches for a spin on a 2-socket 32 core (64 threads) system > with Intel 8336C (Ice Lake) and 512GB of RAM. > > For the initial testing, "pft-threads" from the mm-tests suite[0] was > used. The test mmaps a memory region (~100GB on the test system) and > triggers access by a number of threads executing in parallel. For each > degree of parallelism, the test is repeated 10 times to get a better > feel for the behaviour. Below is an excerpt of the harmonic mean > reported by 'compare_kernel' script[1] included with mm-tests. > > The first column is results for mm-unstable as of 2023-02-10, the second > column is the patches posted here while the third column includes > optimizations to reclaim some of the observed regression. > > From the results, there is a drop in page fault/second for low number of > CPUs but good improvement with higher CPUs. > > 6.2.0-rc4 6.2.0-rc4 6.2.0-rc4 > mm-unstable-20230210 pvl-v2 pvl-v2+opt > > Hmean faults/cpu-1 898792.9338 ( 0.00%) 894597.0474 * -0.47%* 895933.2782 * -0.32%* > Hmean faults/cpu-4 751903.9803 ( 0.00%) 677764.2975 * -9.86%* 688643.8163 * -8.41%* > Hmean faults/cpu-7 612275.5663 ( 0.00%) 565363.4137 * -7.66%* 597538.9396 * -2.41%* > Hmean faults/cpu-12 434460.9074 ( 0.00%) 410974.2708 * -5.41%* 452501.4290 * 4.15%* > Hmean faults/cpu-21 291475.5165 ( 0.00%) 293936.8460 ( 0.84%) 308712.2434 * 5.91%* > Hmean faults/cpu-30 218021.3980 ( 0.00%) 228265.0559 * 4.70%* 241897.5225 * 10.95%* > Hmean faults/cpu-48 141798.5030 ( 0.00%) 162322.5972 * 14.47%* 166081.9459 * 17.13%* > Hmean faults/cpu-79 90060.9577 ( 0.00%) 107028.7779 * 18.84%* 109810.4488 * 21.93%* > Hmean faults/cpu-110 64729.3561 ( 0.00%) 80597.7246 * 24.51%* 83134.0679 * 28.43%* > Hmean faults/cpu-128 55740.1334 ( 0.00%) 68395.4426 * 22.70%* 69248.2836 * 24.23%* > > Hmean faults/sec-1 898781.7694 ( 0.00%) 894247.3174 * -0.50%* 894440.3118 * -0.48%* > Hmean faults/sec-4 2965588.9697 ( 0.00%) 2683651.5664 * -9.51%* 2726450.9710 * -8.06%* > Hmean faults/sec-7 4144512.3996 ( 0.00%) 3891644.2128 * -6.10%* 4099918.8601 ( -1.08%) > Hmean faults/sec-12 4969513.6934 ( 0.00%) 4829731.4355 * -2.81%* 5264682.7371 * 5.94%* > Hmean faults/sec-21 5814379.4789 ( 0.00%) 5941405.3116 * 2.18%* 6263716.3903 * 7.73%* > Hmean faults/sec-30 6153685.3709 ( 0.00%) 6489311.6634 * 5.45%* 6910843.5858 * 12.30%* > Hmean faults/sec-48 6197953.1327 ( 0.00%) 7216320.7727 * 16.43%* 7412782.2927 * 19.60%* > Hmean faults/sec-79 6167135.3738 ( 0.00%) 7425927.1022 * 20.41%* 7637042.2198 * 23.83%* > Hmean faults/sec-110 6264768.2247 ( 0.00%) 7813329.3863 * 24.72%* 7984344.4005 * 27.45%* > Hmean faults/sec-128 6460727.8216 ( 0.00%) 7875664.8999 * 21.90%* 8049910.3601 * 24.60%* The above workload represent the worst case with regards to per-VMA locks as it creates a single large VMA. As a follow-up, I modified pft[2] to create a VMA per thread to understand the behaviour in scenarios where per-VMA locks should show the most benefit. 6.2.0-rc4 6.2.0-rc4 6.2.0-rc4 mm-unstable-20230210 pvl-v2 pvl-v2+opt Hmean faults/cpu-1 905497.4354 ( 0.00%) 888736.5570 * -1.85%* 888695.2675 * -1.86%* Hmean faults/cpu-4 758519.2719 ( 0.00%) 812103.1991 * 7.06%* 825077.9277 * 8.77%* Hmean faults/cpu-7 617153.8038 ( 0.00%) 729943.4518 * 18.28%* 770872.3161 * 24.91%* Hmean faults/cpu-12 424848.5266 ( 0.00%) 550357.2856 * 29.54%* 597478.5634 * 40.63%* Hmean faults/cpu-21 290142.9988 ( 0.00%) 383668.3190 * 32.23%* 433376.8959 * 49.37%* Hmean faults/cpu-30 218705.2915 ( 0.00%) 299888.5533 * 37.12%* 342640.6153 * 56.67%* Hmean faults/cpu-48 142842.3372 ( 0.00%) 206498.2605 * 44.56%* 240306.3442 * 68.23%* Hmean faults/cpu-79 90706.1425 ( 0.00%) 160006.6800 * 76.40%* 185298.4326 * 104.28%* Hmean faults/cpu-110 67011.9297 ( 0.00%) 143536.0062 * 114.19%* 162688.8015 * 142.78%* Hmean faults/cpu-128 55986.4986 ( 0.00%) 136550.8760 * 143.90%* 152718.8713 * 172.78%* Hmean faults/sec-1 905492.1265 ( 0.00%) 887244.6592 * -2.02%* 887775.6079 * -1.96%* Hmean faults/sec-4 2994284.4204 ( 0.00%) 3154236.9408 * 5.34%* 3221994.8465 * 7.60%* Hmean faults/sec-7 4177411.3461 ( 0.00%) 4933286.4045 * 18.09%* 5202347.2077 * 24.54%* Hmean faults/sec-12 4892848.3633 ( 0.00%) 6054577.0988 * 23.74%* 6511987.1142 * 33.09%* Hmean faults/sec-21 5823534.1820 ( 0.00%) 7637637.4162 * 31.15%* 8553362.3513 * 46.88%* Hmean faults/sec-30 6247210.8414 ( 0.00%) 8598150.6717 * 37.63%* 9799696.0945 * 56.87%* Hmean faults/sec-48 6274617.1419 ( 0.00%) 9467132.3699 * 50.88%* 11049401.9072 * 76.10%* Hmean faults/sec-79 6187291.4971 ( 0.00%) 11919062.5284 * 92.64%* 13420825.3820 * 116.91%* Hmean faults/sec-110 6454542.3239 ( 0.00%) 15050228.1869 * 133.17%* 16667873.7618 * 158.23%* Hmean faults/sec-128 6472970.8548 ( 0.00%) 16647275.6575 * 157.18%* 18680029.3714 * 188.59%* As expected, the tests highlight the improved scalability as core count increases. > [0] https://github.com/gormanm/mmtests > [1] https://github.com/gormanm/mmtests/blob/master/compare-kernels.sh [2] https://github.com/gormanm/pft/pull/1/commits/8fe554a3d8b4f5947cd00d4b46f97178b8ba8752
On Tue, Feb 28, 2023 at 4:06 AM Punit Agrawal <punit.agrawal@bytedance.com> wrote: > > Punit Agrawal <punit.agrawal@bytedance.com> writes: > > > Suren Baghdasaryan <surenb@google.com> writes: > > > >> Previous version: > >> v1: https://lore.kernel.org/all/20230109205336.3665937-1-surenb@google.com/ > >> RFC: https://lore.kernel.org/all/20220901173516.702122-1-surenb@google.com/ > >> > >> LWN article describing the feature: > >> https://lwn.net/Articles/906852/ > >> > >> Per-vma locks idea that was discussed during SPF [1] discussion at LSF/MM > >> last year [2], which concluded with suggestion that “a reader/writer > >> semaphore could be put into the VMA itself; that would have the effect of > >> using the VMA as a sort of range lock. There would still be contention at > >> the VMA level, but it would be an improvement.” This patchset implements > >> this suggested approach. > > > > I took the patches for a spin on a 2-socket 32 core (64 threads) system > > with Intel 8336C (Ice Lake) and 512GB of RAM. > > > > For the initial testing, "pft-threads" from the mm-tests suite[0] was > > used. The test mmaps a memory region (~100GB on the test system) and > > triggers access by a number of threads executing in parallel. For each > > degree of parallelism, the test is repeated 10 times to get a better > > feel for the behaviour. Below is an excerpt of the harmonic mean > > reported by 'compare_kernel' script[1] included with mm-tests. > > > > The first column is results for mm-unstable as of 2023-02-10, the second > > column is the patches posted here while the third column includes > > optimizations to reclaim some of the observed regression. > > > > From the results, there is a drop in page fault/second for low number of > > CPUs but good improvement with higher CPUs. > > > > 6.2.0-rc4 6.2.0-rc4 6.2.0-rc4 > > mm-unstable-20230210 pvl-v2 pvl-v2+opt > > > > Hmean faults/cpu-1 898792.9338 ( 0.00%) 894597.0474 * -0.47%* 895933.2782 * -0.32%* > > Hmean faults/cpu-4 751903.9803 ( 0.00%) 677764.2975 * -9.86%* 688643.8163 * -8.41%* > > Hmean faults/cpu-7 612275.5663 ( 0.00%) 565363.4137 * -7.66%* 597538.9396 * -2.41%* > > Hmean faults/cpu-12 434460.9074 ( 0.00%) 410974.2708 * -5.41%* 452501.4290 * 4.15%* > > Hmean faults/cpu-21 291475.5165 ( 0.00%) 293936.8460 ( 0.84%) 308712.2434 * 5.91%* > > Hmean faults/cpu-30 218021.3980 ( 0.00%) 228265.0559 * 4.70%* 241897.5225 * 10.95%* > > Hmean faults/cpu-48 141798.5030 ( 0.00%) 162322.5972 * 14.47%* 166081.9459 * 17.13%* > > Hmean faults/cpu-79 90060.9577 ( 0.00%) 107028.7779 * 18.84%* 109810.4488 * 21.93%* > > Hmean faults/cpu-110 64729.3561 ( 0.00%) 80597.7246 * 24.51%* 83134.0679 * 28.43%* > > Hmean faults/cpu-128 55740.1334 ( 0.00%) 68395.4426 * 22.70%* 69248.2836 * 24.23%* > > > > Hmean faults/sec-1 898781.7694 ( 0.00%) 894247.3174 * -0.50%* 894440.3118 * -0.48%* > > Hmean faults/sec-4 2965588.9697 ( 0.00%) 2683651.5664 * -9.51%* 2726450.9710 * -8.06%* > > Hmean faults/sec-7 4144512.3996 ( 0.00%) 3891644.2128 * -6.10%* 4099918.8601 ( -1.08%) > > Hmean faults/sec-12 4969513.6934 ( 0.00%) 4829731.4355 * -2.81%* 5264682.7371 * 5.94%* > > Hmean faults/sec-21 5814379.4789 ( 0.00%) 5941405.3116 * 2.18%* 6263716.3903 * 7.73%* > > Hmean faults/sec-30 6153685.3709 ( 0.00%) 6489311.6634 * 5.45%* 6910843.5858 * 12.30%* > > Hmean faults/sec-48 6197953.1327 ( 0.00%) 7216320.7727 * 16.43%* 7412782.2927 * 19.60%* > > Hmean faults/sec-79 6167135.3738 ( 0.00%) 7425927.1022 * 20.41%* 7637042.2198 * 23.83%* > > Hmean faults/sec-110 6264768.2247 ( 0.00%) 7813329.3863 * 24.72%* 7984344.4005 * 27.45%* > > Hmean faults/sec-128 6460727.8216 ( 0.00%) 7875664.8999 * 21.90%* 8049910.3601 * 24.60%* > > > The above workload represent the worst case with regards to per-VMA > locks as it creates a single large VMA. As a follow-up, I modified > pft[2] to create a VMA per thread to understand the behaviour in > scenarios where per-VMA locks should show the most benefit. > > 6.2.0-rc4 6.2.0-rc4 6.2.0-rc4 > mm-unstable-20230210 pvl-v2 pvl-v2+opt > > Hmean faults/cpu-1 905497.4354 ( 0.00%) 888736.5570 * -1.85%* 888695.2675 * -1.86%* > Hmean faults/cpu-4 758519.2719 ( 0.00%) 812103.1991 * 7.06%* 825077.9277 * 8.77%* > Hmean faults/cpu-7 617153.8038 ( 0.00%) 729943.4518 * 18.28%* 770872.3161 * 24.91%* > Hmean faults/cpu-12 424848.5266 ( 0.00%) 550357.2856 * 29.54%* 597478.5634 * 40.63%* > Hmean faults/cpu-21 290142.9988 ( 0.00%) 383668.3190 * 32.23%* 433376.8959 * 49.37%* > Hmean faults/cpu-30 218705.2915 ( 0.00%) 299888.5533 * 37.12%* 342640.6153 * 56.67%* > Hmean faults/cpu-48 142842.3372 ( 0.00%) 206498.2605 * 44.56%* 240306.3442 * 68.23%* > Hmean faults/cpu-79 90706.1425 ( 0.00%) 160006.6800 * 76.40%* 185298.4326 * 104.28%* > Hmean faults/cpu-110 67011.9297 ( 0.00%) 143536.0062 * 114.19%* 162688.8015 * 142.78%* > Hmean faults/cpu-128 55986.4986 ( 0.00%) 136550.8760 * 143.90%* 152718.8713 * 172.78%* > > Hmean faults/sec-1 905492.1265 ( 0.00%) 887244.6592 * -2.02%* 887775.6079 * -1.96%* > Hmean faults/sec-4 2994284.4204 ( 0.00%) 3154236.9408 * 5.34%* 3221994.8465 * 7.60%* > Hmean faults/sec-7 4177411.3461 ( 0.00%) 4933286.4045 * 18.09%* 5202347.2077 * 24.54%* > Hmean faults/sec-12 4892848.3633 ( 0.00%) 6054577.0988 * 23.74%* 6511987.1142 * 33.09%* > Hmean faults/sec-21 5823534.1820 ( 0.00%) 7637637.4162 * 31.15%* 8553362.3513 * 46.88%* > Hmean faults/sec-30 6247210.8414 ( 0.00%) 8598150.6717 * 37.63%* 9799696.0945 * 56.87%* > Hmean faults/sec-48 6274617.1419 ( 0.00%) 9467132.3699 * 50.88%* 11049401.9072 * 76.10%* > Hmean faults/sec-79 6187291.4971 ( 0.00%) 11919062.5284 * 92.64%* 13420825.3820 * 116.91%* > Hmean faults/sec-110 6454542.3239 ( 0.00%) 15050228.1869 * 133.17%* 16667873.7618 * 158.23%* > Hmean faults/sec-128 6472970.8548 ( 0.00%) 16647275.6575 * 157.18%* 18680029.3714 * 188.59%* > > As expected, the tests highlight the improved scalability as core count > increases. Thanks for trying this, Punit! This is very encouraging. > > > [0] https://github.com/gormanm/mmtests > > [1] https://github.com/gormanm/mmtests/blob/master/compare-kernels.sh > > [2] https://github.com/gormanm/pft/pull/1/commits/8fe554a3d8b4f5947cd00d4b46f97178b8ba8752