Message ID | cover.1685506205.git.raghavendra.kt@amd.com (mailing list archive) |
---|---|
Headers | show |
Series | sched/numa: Fix disjoint set vma scan regression | expand |
Hello Raghavendra, On 5/31/2023 9:55 AM, Raghavendra K T wrote: > With the numa scan enhancements [1], only the threads which had previously > accessed vma are allowed to scan. > > While this had improved significant system time overhead, there were corner > cases, which genuinely need some relaxation for e.g., concern raised by > PeterZ where unfairness amongst the thread belonging to disjoint set of vmas, > that can potentially amplify the side effects, where vma regions belonging > to some of the tasks being left unscanned. > > [1] had handled that issue by allowing first two scans at mm level > (mm->numa_scan_seq) unconditionally. But that was not enough. > > One of the test that exercise similar side effect is numa01_THREAD_ALLOC where > allocation happen by main thread and it is divided into memory chunks of 24MB > to be continuously bzeroed (for 128 threads on my machine). > > This was found in internal LKP run and also reported by [4]. > > While RFC V1 [2] tried to address this issue, the logic had more heuristics. > RFC V2 [3] was rewritten based on vma_size. > > Current implementation drops some of additional logic for long running task > and relooked some of the usage of READ_ONCE/WRITE_ONCE(). > > The current patch addresses the same issue in a more accurate way as > follows: > > (1) Any disjoint vma which is not associated with a task, that tries to > scan is now allowed to induce prot_none faults. Total number of such > unconditional scans allowed per vma is derived based on the exact vma size > as follows: > > total scans allowed = 1/2 * vma_size / scan_size. > > (2) Total scans already done is maintained using a per vma scan counter. > > With above patch, numa01_THREAD_ALLOC regression reported is resolved, > but please note that with [1] there was a drastic decrease in system time > for mmtest numa01, this patch adds back some of the system time. > > Summary: numa scan enhancement patch [1] togethor with the current patchset > improves overall system time by filtering unnecessary numa scan > while still retaining necessary scanning in some corner cases which > involves disjoint set vmas. > > Your comments/Ideas are welcome. > > Changes since: > RFC V2: > 1) Drop reset of scan counter that tried to take care of long running workloads > 2) Correct usage of READ_ONCE/WRITE_ONCE (Bharata) > 3) Base is 6.4.0-rc2 > > RFC V1: > 1) Rewrite entire logic based on actual vma size than heuristics > 2) Added Reported-by kernel test robot and internal LKP test > 3) Rebased to 6.4.-rc1 (ba0ad6ed89) > > Result: > SUT: Milan w/ 2 numa nodes 256 cpus > > Run of numa01_THREAD__ALLOC on 6.4.0-rc2 (that has w/ numascan enhancement) > base-numascan base base+fix > real 1m1.507s 1m23.259s 1m2.632s > user 213m51.336s 251m46.363s 220m35.528s > sys 3m3.397s 0m12.492s 2m41.393s > > numa_hit 5615517 4560123 4963875 > numa_local 5615505 4560024 4963700 > numa_other 12 99 175 > numa_pte_updates 1822797 493 1559111 > numa_hint_faults 1307113 523 1469031 > numa_hint_faults_local 612617 488 884829 > numa_pages_migrated 694370 35 584202 > > We can see regression in base real time recovered, but with some additional > system time overhead. > > Below is the mmtest autonuma performance > > autonumabench > =========== > (base 6.4.0-rc2 that has numascan enhancement) > base-numascan base base+fix > Amean syst-NUMA01 300.46 ( 0.00%) 23.97 * 92.02%* 67.18 * 77.64%* > Amean syst-NUMA01_THREADLOCAL 0.20 ( 0.00%) 0.22 * -9.15%* 0.22 * -9.15%* > Amean syst-NUMA02 0.70 ( 0.00%) 0.71 * -0.61%* 0.70 * 0.41%* > Amean syst-NUMA02_SMT 0.58 ( 0.00%) 0.62 * -5.38%* 0.61 * -3.67%* > Amean elsp-NUMA01 320.92 ( 0.00%) 276.13 * 13.96%* 324.11 * -0.99%* > Amean elsp-NUMA01_THREADLOCAL 1.02 ( 0.00%) 1.03 * -1.83%* 1.03 * -1.83%* > Amean elsp-NUMA02 3.16 ( 0.00%) 3.93 * -24.20%* 3.14 * 0.81%* > Amean elsp-NUMA02_SMT 3.82 ( 0.00%) 3.87 * -1.27%* 3.44 * 9.90%* > > Duration User 403532.43 279173.53 359098.23 > Duration System 2114.31 179.20 481.54 > Duration Elapsed 2312.20 2004.48 2335.84 > > Ops NUMA alloc hit 55795455.00 45452739.00 45500387.00 > Ops NUMA alloc local 55794177.00 45435858.00 45500070.00 > Ops NUMA base-page range updates 147858285.00 18601.00 42043107.00 > Ops NUMA PTE updates 147858285.00 18601.00 42043107.00 > Ops NUMA hint faults 150531983.00 18254.00 42450080.00 > Ops NUMA hint local faults % 125691825.00 11964.00 32993313.00 > Ops NUMA hint local percent 83.50 65.54 77.72 > Ops NUMA pages migrated 13535786.00 2207.00 4654628.00 > Ops AutoNUMA cost 753952.10 91.44 212633.14 > > Please note there is a system time overhead added for numa01 but we still have very > good improvement w.r.t base without numascan. > I tested the patch with lkp autonuma benchmark on a dual socket 4th Generation EPYC server (2 X 96C/192T) running in NPS1 mode. Below are the results: commit: 6.4.0-rc2 6.4.0-rc2+patch 6.4.0-rc2 6.4.0-rc2+patch ---------------- --------------------------- %stddev %change %stddev \ | \ 501.84 -12.5% 439.14 numa01.seconds 228.66 -1.8% 224.44 numa01_THREAD_ALLOC.seconds 0.51 +21.6% 0.62 numa02.seconds 107.17 +0.0% 107.17 numa02_SMT.seconds 2936 -9.1% 2669 elapsed_time 794910 +3.7% 824178 system_time 474520 -17.5% 391331 user_time Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com> > [1] Link: https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t > [2] Link: https://lore.kernel.org/lkml/cover.1683033105.git.raghavendra.kt@amd.com/ > [3] Link: https://lore.kernel.org/lkml/cover.1684228065.git.raghavendra.kt@amd.com/T/ > [4] Link: https://lore.kernel.org/lkml/db995c11-08ba-9abf-812f-01407f70a5d4@amd.com/T/ > > Raghavendra K T (1): > sched/numa: Fix disjoint set vma scan regression > > include/linux/mm_types.h | 1 + > kernel/sched/fair.c | 31 ++++++++++++++++++++++++------- > 2 files changed, 25 insertions(+), 7 deletions(-) > -- Thanks and regards, Swapnil
On 6/7/2023 5:10 PM, Sapkal Swapnil wrote: > Hello Raghavendra, > > On 5/31/2023 9:55 AM, Raghavendra K T wrote: >> With the numa scan enhancements [1], only the threads which had >> previously >> accessed vma are allowed to scan. >> >> While this had improved significant system time overhead, there were >> corner >> cases, which genuinely need some relaxation for e.g., concern raised by >> PeterZ where unfairness amongst the thread belonging to disjoint set >> of vmas, >> that can potentially amplify the side effects, where vma regions >> belonging >> to some of the tasks being left unscanned. >> >> [1] had handled that issue by allowing first two scans at mm level >> (mm->numa_scan_seq) unconditionally. But that was not enough. >> >> One of the test that exercise similar side effect is >> numa01_THREAD_ALLOC where >> allocation happen by main thread and it is divided into memory chunks >> of 24MB >> to be continuously bzeroed (for 128 threads on my machine). >> >> This was found in internal LKP run and also reported by [4]. >> >> While RFC V1 [2] tried to address this issue, the logic had more >> heuristics. >> RFC V2 [3] was rewritten based on vma_size. >> >> Current implementation drops some of additional logic for long running >> task >> and relooked some of the usage of READ_ONCE/WRITE_ONCE(). >> >> The current patch addresses the same issue in a more accurate way as >> follows: >> >> (1) Any disjoint vma which is not associated with a task, that tries to >> scan is now allowed to induce prot_none faults. Total number of such >> unconditional scans allowed per vma is derived based on the exact vma >> size >> as follows: >> >> total scans allowed = 1/2 * vma_size / scan_size. >> >> (2) Total scans already done is maintained using a per vma scan counter. >> >> With above patch, numa01_THREAD_ALLOC regression reported is resolved, >> but please note that with [1] there was a drastic decrease in system time >> for mmtest numa01, this patch adds back some of the system time. >> >> Summary: numa scan enhancement patch [1] togethor with the current >> patchset >> improves overall system time by filtering unnecessary numa scan >> while still retaining necessary scanning in some corner cases which >> involves disjoint set vmas. >> >> Your comments/Ideas are welcome. >> >> Changes since: >> RFC V2: >> 1) Drop reset of scan counter that tried to take care of long running >> workloads >> 2) Correct usage of READ_ONCE/WRITE_ONCE (Bharata) >> 3) Base is 6.4.0-rc2 >> >> RFC V1: >> 1) Rewrite entire logic based on actual vma size than heuristics >> 2) Added Reported-by kernel test robot and internal LKP test >> 3) Rebased to 6.4.-rc1 (ba0ad6ed89) >> >> Result: >> SUT: Milan w/ 2 numa nodes 256 cpus >> >> Run of numa01_THREAD__ALLOC on 6.4.0-rc2 (that has w/ numascan >> enhancement) >> base-numascan base base+fix >> real 1m1.507s 1m23.259s 1m2.632s >> user 213m51.336s 251m46.363s 220m35.528s >> sys 3m3.397s 0m12.492s 2m41.393s >> >> numa_hit 5615517 4560123 4963875 >> numa_local 5615505 4560024 4963700 >> numa_other 12 99 175 >> numa_pte_updates 1822797 493 1559111 >> numa_hint_faults 1307113 523 1469031 >> numa_hint_faults_local 612617 488 884829 >> numa_pages_migrated 694370 35 584202 >> >> We can see regression in base real time recovered, but with some >> additional >> system time overhead. >> >> Below is the mmtest autonuma performance >> >> autonumabench >> =========== >> (base 6.4.0-rc2 that has numascan enhancement) >> base-numascan base base+fix >> Amean syst-NUMA01 300.46 ( 0.00%) 23.97 >> * 92.02%* 67.18 * 77.64%* >> Amean syst-NUMA01_THREADLOCAL 0.20 ( 0.00%) 0.22 >> * -9.15%* 0.22 * -9.15%* >> Amean syst-NUMA02 0.70 ( 0.00%) 0.71 >> * -0.61%* 0.70 * 0.41%* >> Amean syst-NUMA02_SMT 0.58 ( 0.00%) 0.62 >> * -5.38%* 0.61 * -3.67%* >> Amean elsp-NUMA01 320.92 ( 0.00%) 276.13 >> * 13.96%* 324.11 * -0.99%* >> Amean elsp-NUMA01_THREADLOCAL 1.02 ( 0.00%) 1.03 >> * -1.83%* 1.03 * -1.83%* >> Amean elsp-NUMA02 3.16 ( 0.00%) 3.93 * >> -24.20%* 3.14 * 0.81%* >> Amean elsp-NUMA02_SMT 3.82 ( 0.00%) 3.87 >> * -1.27%* 3.44 * 9.90%* >> >> Duration User 403532.43 279173.53 359098.23 >> Duration System 2114.31 179.20 481.54 >> Duration Elapsed 2312.20 2004.48 2335.84 >> >> Ops NUMA alloc hit 55795455.00 45452739.00 >> 45500387.00 >> Ops NUMA alloc local 55794177.00 45435858.00 >> 45500070.00 >> Ops NUMA base-page range updates 147858285.00 18601.00 >> 42043107.00 >> Ops NUMA PTE updates 147858285.00 18601.00 >> 42043107.00 >> Ops NUMA hint faults 150531983.00 18254.00 >> 42450080.00 >> Ops NUMA hint local faults % 125691825.00 11964.00 >> 32993313.00 >> Ops NUMA hint local percent 83.50 >> 65.54 77.72 >> Ops NUMA pages migrated 13535786.00 2207.00 >> 4654628.00 >> Ops AutoNUMA cost 753952.10 91.44 >> 212633.14 >> >> Please note there is a system time overhead added for numa01 but we >> still have very >> good improvement w.r.t base without numascan. >> > > I tested the patch with lkp autonuma benchmark on a dual socket 4th > Generation EPYC server (2 X 96C/192T) running in NPS1 mode. Below are > the results: > > commit: > 6.4.0-rc2 > 6.4.0-rc2+patch > > 6.4.0-rc2 6.4.0-rc2+patch > ---------------- --------------------------- > %stddev %change %stddev > \ | \ > 501.84 -12.5% 439.14 numa01.seconds > 228.66 -1.8% 224.44 numa01_THREAD_ALLOC.seconds > 0.51 +21.6% 0.62 numa02.seconds > 107.17 +0.0% 107.17 numa02_SMT.seconds > 2936 -9.1% 2669 elapsed_time > 794910 +3.7% 824178 system_time > 474520 -17.5% 391331 user_time > > Tested-by: Swapnil Sapkal <swapnil.sapkal@amd.com> > >> [1] Link: >> https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t >> >> [2] Link: >> https://lore.kernel.org/lkml/cover.1683033105.git.raghavendra.kt@amd.com/ >> [3] Link: >> https://lore.kernel.org/lkml/cover.1684228065.git.raghavendra.kt@amd.com/T/ >> >> [4] Link: >> https://lore.kernel.org/lkml/db995c11-08ba-9abf-812f-01407f70a5d4@amd.com/T/ >> >> >> Raghavendra K T (1): >> sched/numa: Fix disjoint set vma scan regression >> >> include/linux/mm_types.h | 1 + >> kernel/sched/fair.c | 31 ++++++++++++++++++++++++------- >> 2 files changed, 25 insertions(+), 7 deletions(-) >> > -- > Thanks and regards, > Swapnil Thank you Swapnil. It reminds again that LKP's numa01 = numa01_THREAD_ALLOC which has regained numbers. I will also wait if kernel-test-robot also sees issue fixed. and also if Mel/Peter have any objections/comment on the direction. Regards