Message ID | b0a8f3490b491d4fd003c3e0493e940afaea5f2c.1684228065.git.raghavendra.kt@amd.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | sched/numa: Fix disjoint set vma scan regression | expand |
On 16-May-23 2:49 PM, Raghavendra K T wrote: > With the numa scan enhancements [1], only the threads which had previously > accessed vma are allowed to scan. > > While this had improved significant system time overhead, there were corner > cases, which genuinely need some relaxation. For e.g., > > 1) Concern raised by PeterZ, where if there are N partition sets of vmas > belonging to tasks, then unfairness in allowing these threads to scan could > potentially amplify the side effect of some of the vmas being left > unscanned. > > 2) Below reports of LKP numa01 benchmark regression. > > Currently this was handled by allowing first two scanning unconditional > as indicated by mm->numa_scan_seq. This is imprecise since for some > benchmark vma scanning might itself start at numa_scan_seq > 2. > > Solution: > Allow unconditional scanning of vmas of tasks depending on vma size. This > is achieved by maintaining a per vma scan counter, where > > f(allowed_to_scan) = f(scan_counter < vma_size / scan_size) > > Fixes: fc137c0ddab2 ("sched/numa: enhance vma scanning logic") > regression. > > Result: > numa01_THREAD_ALLOC result on 6.4.0-rc1 (that has w/ numascan enhancement) > base-numascan base base+fix > real 1m3.025s 1m24.163s 1m3.551s > user 213m44.232s 251m3.638s 219m55.662s > sys 6m26.598s 0m13.056s 2m35.767s > > numa_hit 5478165 4395752 4907431 > numa_local 5478103 4395366 4907044 > numa_other 62 386 387 > numa_pte_updates 1989274 11606 1265014 > numa_hint_faults 1756059 515 1135804 > numa_hint_faults_local 971500 486 558076 > numa_pages_migrated 784211 29 577728 > > Summary: Regression in base is recovered by allowing scanning as required. > > [1] https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t > > Reported-by: Aithal Srikanth <sraithal@amd.com> > Reported-by: kernel test robot <oliver.sang@intel.com> > Closes: https://lore.kernel.org/lkml/db995c11-08ba-9abf-812f-01407f70a5d4@amd.com/T/ > Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com> > --- > include/linux/mm_types.h | 1 + > kernel/sched/fair.c | 41 ++++++++++++++++++++++++++++++++-------- > 2 files changed, 34 insertions(+), 8 deletions(-) > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 306a3d1a0fa6..992e460a713e 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -479,6 +479,7 @@ struct vma_numab_state { > unsigned long next_scan; > unsigned long next_pid_reset; > unsigned long access_pids[2]; > + unsigned int scan_counter; > }; > > /* > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 373ff5f55884..2c3e17e7fc2f 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2931,20 +2931,34 @@ static void reset_ptenuma_scan(struct task_struct *p) > static bool vma_is_accessed(struct vm_area_struct *vma) > { > unsigned long pids; > + unsigned int vma_size; > + unsigned int scan_threshold; > + unsigned int scan_size; > + > + pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1]; > + > + if (test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids)) > + return true; > + > + scan_size = READ_ONCE(sysctl_numa_balancing_scan_size); > + /* vma size in MB */ > + vma_size = (vma->vm_end - vma->vm_start) >> 20; > + > + /* Total scans needed to cover VMA */ > + scan_threshold = (vma_size / scan_size); > + > /* > - * Allow unconditional access first two times, so that all the (pages) > - * of VMAs get prot_none fault introduced irrespective of accesses. > + * Allow the scanning of half of disjoint set's VMA to induce > + * prot_none fault irrespective of accesses. > * This is also done to avoid any side effect of task scanning > * amplifying the unfairness of disjoint set of VMAs' access. > */ > - if (READ_ONCE(current->mm->numa_scan_seq) < 2) > - return true; > - > - pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1]; > - return test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids); > + scan_threshold = 1 + (scan_threshold >> 1); > + return (READ_ONCE(vma->numab_state->scan_counter) <= scan_threshold); > } > > -#define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay) > +#define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay) > +#define DISJOINT_VMA_SCAN_RENEW_THRESH 16 > > /* > * The expensive part of numa migration is done from task_work context. > @@ -3058,6 +3072,8 @@ static void task_numa_work(struct callback_head *work) > /* Reset happens after 4 times scan delay of scan start */ > vma->numab_state->next_pid_reset = vma->numab_state->next_scan + > msecs_to_jiffies(VMA_PID_RESET_PERIOD); > + > + WRITE_ONCE(vma->numab_state->scan_counter, 0); > } > > /* > @@ -3068,6 +3084,13 @@ static void task_numa_work(struct callback_head *work) > vma->numab_state->next_scan)) > continue; > > + /* > + * For long running tasks, renew the disjoint vma scanning > + * periodically. > + */ > + if (mm->numa_scan_seq && !(mm->numa_scan_seq % DISJOINT_VMA_SCAN_RENEW_THRESH)) Don't you need a READ_ONCE() accessor for mm->numa_scan_seq? Regards, Bharata.
On 5/19/2023 1:26 PM, Bharata B Rao wrote: > On 16-May-23 2:49 PM, Raghavendra K T wrote: >> With the numa scan enhancements [1], only the threads which had previously [...] >> -#define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay) >> +#define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay) >> +#define DISJOINT_VMA_SCAN_RENEW_THRESH 16 >> >> /* >> * The expensive part of numa migration is done from task_work context. >> @@ -3058,6 +3072,8 @@ static void task_numa_work(struct callback_head *work) >> /* Reset happens after 4 times scan delay of scan start */ >> vma->numab_state->next_pid_reset = vma->numab_state->next_scan + >> msecs_to_jiffies(VMA_PID_RESET_PERIOD); >> + >> + WRITE_ONCE(vma->numab_state->scan_counter, 0); >> } >> >> /* >> @@ -3068,6 +3084,13 @@ static void task_numa_work(struct callback_head *work) >> vma->numab_state->next_scan)) >> continue; >> >> + /* >> + * For long running tasks, renew the disjoint vma scanning >> + * periodically. >> + */ >> + if (mm->numa_scan_seq && !(mm->numa_scan_seq % DISJOINT_VMA_SCAN_RENEW_THRESH)) > > Don't you need a READ_ONCE() accessor for mm->numa_scan_seq? > Hello Bharata, Yes.. Thanks for pointing out.. V1 I did ensure that, But in V2 somehow leftout :( . On the other-hand I see vma->numab_state->scan_counter does not need READ_ONCE/WRITE_ONCE since it is not modified out of this function (i.e. it is all done after cmpxchg above).. Also thinking more, DISJOINT_VMA_SCAN_RENEW_THRESH reset change itself may need some correction, and doesn't seem to be absolutely necessary here. (will post that separately for improving long running benchmark as per my experiment with more detail) will wait for any confirmation of reported regression fix with this patch and/or any better idea/ack for a while and repost.
Hello, kernel test robot noticed a -46.3% improvement of autonuma-benchmark.numa01.seconds on: commit: d281d36ed007eabb243ad2d489c52c43961f8ac3 ("[RFC PATCH V2 1/1] sched/numa: Fix disjoint set vma scan regression") url: https://github.com/intel-lab-lkp/linux/commits/Raghavendra-K-T/sched-numa-Fix-disjoint-set-vma-scan-regression/20230516-180954 base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git a6fcdd8d95f7486150b3faadfea119fc3dfc3b74 patch link: https://lore.kernel.org/all/b0a8f3490b491d4fd003c3e0493e940afaea5f2c.1684228065.git.raghavendra.kt@amd.com/ patch subject: [RFC PATCH V2 1/1] sched/numa: Fix disjoint set vma scan regression we noticed this patch addressed the performance regression we reported https://lore.kernel.org/all/202305101547.20f4c32a-oliver.sang@intel.com/ we also noticed there is still some discussion in the thread https://lore.kernel.org/all/202305101547.20f4c32a-oliver.sang@intel.com/ since we didn't see V3 patch, send out this report for your information about its performance impact. testcase: autonuma-benchmark test machine: 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz (Cascade Lake) with 128G memory parameters: iterations: 4x test: numa02_SMT cpufreq_governor: performance Details are as below: --------------------------------------------------------------------------------------------------> To reproduce: git clone https://github.com/intel/lkp-tests.git cd lkp-tests sudo bin/lkp install job.yaml # job file is attached in this email bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run sudo bin/lkp run generated-yaml-file # if come across any failure that blocks the test, # please remove ~/.lkp and /lkp dir to run from a clean state. ========================================================================================= compiler/cpufreq_governor/iterations/kconfig/rootfs/tbox_group/test/testcase: gcc-11/performance/4x/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-csl-2sp9/numa02_SMT/autonuma-benchmark commit: a6fcdd8d95 ("sched/debug: Correct printing for rq->nr_uninterruptible") d281d36ed0 ("sched/numa: Fix disjoint set vma scan regression") a6fcdd8d95f74861 d281d36ed007eabb243ad2d489c ---------------- --------------------------- %stddev %change %stddev \ | \ 1899 ± 2% -37.4% 1189 ± 18% uptime.boot 1809 +3.1% 1866 vmstat.system.cs 1.685e+10 ± 3% -52.2% 8.052e+09 ± 11% cpuidle..time 17400470 ± 3% -52.3% 8308590 ± 11% cpuidle..usage 26350 ± 7% -9.2% 23932 meminfo.Active 26238 ± 7% -9.2% 23828 meminfo.Active(anon) 38666 ± 8% +21.4% 46957 ± 7% meminfo.Mapped 1996 -16.5% 1666 meminfo.Mlocked 23350 ± 56% +64.7% 38446 ± 8% numa-meminfo.node0.Mapped 5108 ± 6% +78.7% 9132 ± 43% numa-meminfo.node0.Shmem 25075 ± 8% -9.3% 22750 numa-meminfo.node1.Active 25057 ± 8% -9.5% 22681 numa-meminfo.node1.Active(anon) 2038104 ± 7% -34.9% 1327290 ± 12% numa-numastat.node0.local_node 2394647 ± 5% -30.9% 1655892 ± 15% numa-numastat.node0.numa_hit 1988880 ± 7% -27.5% 1442918 ± 12% numa-numastat.node1.local_node 2255986 ± 6% -23.0% 1737172 ± 17% numa-numastat.node1.numa_hit 10.54 ± 3% -1.7 8.83 ± 9% mpstat.cpu.all.idle% 0.00 ± 74% +0.1 0.05 ± 13% mpstat.cpu.all.iowait% 2.48 -0.9 1.57 mpstat.cpu.all.irq% 0.08 ± 2% -0.0 0.06 ± 3% mpstat.cpu.all.soft% 1.48 +0.7 2.19 ± 4% mpstat.cpu.all.sys% 427.10 -46.3% 229.25 ± 3% autonuma-benchmark.numa01.seconds 1819 -43.5% 1027 ± 3% autonuma-benchmark.time.elapsed_time 1819 -43.5% 1027 ± 3% autonuma-benchmark.time.elapsed_time.max 791068 ± 2% -43.6% 446212 ± 4% autonuma-benchmark.time.involuntary_context_switches 2089497 -16.8% 1737489 ± 2% autonuma-benchmark.time.minor_page_faults 7603 +3.2% 7848 autonuma-benchmark.time.percent_of_cpu_this_job_got 136519 -42.2% 78864 ± 3% autonuma-benchmark.time.user_time 22402 +44.8% 32429 ± 4% autonuma-benchmark.time.voluntary_context_switches 5919 ± 55% +64.7% 9747 ± 9% numa-vmstat.node0.nr_mapped 1277 ± 6% +77.6% 2268 ± 43% numa-vmstat.node0.nr_shmem 2394430 ± 5% -30.9% 1655441 ± 15% numa-vmstat.node0.numa_hit 2037887 ± 7% -34.9% 1326839 ± 12% numa-vmstat.node0.numa_local 6261 ± 8% -9.2% 5683 numa-vmstat.node1.nr_active_anon 6261 ± 8% -9.2% 5683 numa-vmstat.node1.nr_zone_active_anon 2255543 ± 6% -23.0% 1736429 ± 17% numa-vmstat.node1.numa_hit 1988436 ± 7% -27.5% 1442174 ± 12% numa-vmstat.node1.numa_local 35815 ± 5% -23.8% 27284 ± 17% turbostat.C1 0.03 ± 17% +0.0 0.07 ± 12% turbostat.C1E% 17197885 ± 3% -52.7% 8127065 ± 11% turbostat.C6 10.48 ± 3% -1.7 8.80 ± 10% turbostat.C6% 10.23 ± 3% -17.3% 8.46 ± 10% turbostat.CPU%c1 0.24 ± 7% +61.3% 0.38 ± 11% turbostat.CPU%c6 1.615e+08 -42.4% 93035289 ± 3% turbostat.IRQ 48830 ± 13% -35.2% 31632 ± 11% turbostat.POLL 0.19 ± 7% +61.1% 0.30 ± 11% turbostat.Pkg%pc2 238.01 +5.2% 250.27 turbostat.PkgWatt 22.38 +25.3% 28.03 turbostat.RAMWatt 6557 ± 7% -9.1% 5963 proc-vmstat.nr_active_anon 1539398 -4.8% 1465253 proc-vmstat.nr_anon_pages 2955 -5.8% 2785 proc-vmstat.nr_anon_transparent_hugepages 1541555 -4.7% 1468824 proc-vmstat.nr_inactive_anon 9843 ± 8% +21.4% 11949 ± 7% proc-vmstat.nr_mapped 499.00 -16.5% 416.67 proc-vmstat.nr_mlock 3896 -3.2% 3770 proc-vmstat.nr_page_table_pages 6557 ± 7% -9.1% 5963 proc-vmstat.nr_zone_active_anon 1541555 -4.7% 1468824 proc-vmstat.nr_zone_inactive_anon 30446 ± 15% +397.7% 151532 ± 4% proc-vmstat.numa_hint_faults 21562 ± 12% +312.9% 89028 ± 3% proc-vmstat.numa_hint_faults_local 4651965 -27.0% 3395711 ± 2% proc-vmstat.numa_hit 5122 ± 7% +1393.9% 76529 ± 5% proc-vmstat.numa_huge_pte_updates 4028316 -31.2% 2772852 ± 2% proc-vmstat.numa_local 1049660 +672.3% 8106150 ± 6% proc-vmstat.numa_pages_migrated 2725369 ± 7% +1343.9% 39352403 ± 5% proc-vmstat.numa_pte_updates 45132 ± 31% +31.9% 59519 proc-vmstat.pgactivate 1.816e+08 ± 2% +5.6% 1.918e+08 ± 3% proc-vmstat.pgalloc_normal 5863913 -30.2% 4092045 ± 2% proc-vmstat.pgfault 1.815e+08 ± 2% +5.7% 1.918e+08 ± 3% proc-vmstat.pgfree 1049660 +672.3% 8106150 ± 6% proc-vmstat.pgmigrate_success 264923 -35.1% 171993 ± 2% proc-vmstat.pgreuse 2037 +675.1% 15790 ± 6% proc-vmstat.thp_migration_success 13598464 -42.9% 7770880 ± 3% proc-vmstat.unevictable_pgs_scanned 3208 ± 14% +44.1% 4624 ± 12% sched_debug.cfs_rq:/.load.min 2.73 ± 16% +52.6% 4.16 ± 14% sched_debug.cfs_rq:/.load_avg.min 94294753 ± 2% -45.7% 51173318 ± 3% sched_debug.cfs_rq:/.min_vruntime.avg 98586361 ± 2% -46.3% 52983552 ± 3% sched_debug.cfs_rq:/.min_vruntime.max 85615972 ± 2% -45.4% 46737672 ± 3% sched_debug.cfs_rq:/.min_vruntime.min 2806211 ± 7% -53.1% 1314959 ± 6% sched_debug.cfs_rq:/.min_vruntime.stddev 2.63 ± 23% +65.8% 4.36 ± 24% sched_debug.cfs_rq:/.removed.load_avg.avg 1.08 ± 23% +55.4% 1.68 ± 19% sched_debug.cfs_rq:/.removed.runnable_avg.avg 1.07 ± 24% +56.7% 1.68 ± 19% sched_debug.cfs_rq:/.removed.util_avg.avg 7252565 ± 13% -46.6% 3874343 ± 15% sched_debug.cfs_rq:/.spread0.avg 11534195 ± 10% -50.8% 5673455 ± 12% sched_debug.cfs_rq:/.spread0.max -1406320 -61.2% -546147 sched_debug.cfs_rq:/.spread0.min 2795186 ± 7% -53.2% 1309202 ± 6% sched_debug.cfs_rq:/.spread0.stddev 6.57 ± 40% +6868.7% 457.74 ± 6% sched_debug.cfs_rq:/.util_est_enqueued.avg 275.73 ± 43% +333.0% 1193 ± 5% sched_debug.cfs_rq:/.util_est_enqueued.max 37.70 ± 41% +722.1% 309.90 ± 3% sched_debug.cfs_rq:/.util_est_enqueued.stddev 794516 ± 5% -26.2% 586654 ± 17% sched_debug.cpu.avg_idle.min 224.87 ± 4% -42.6% 129.01 ± 10% sched_debug.cpu.clock.stddev 885581 ± 2% -42.6% 508722 ± 3% sched_debug.cpu.clock_task.min 19466 ± 2% -30.4% 13558 ± 4% sched_debug.cpu.curr->pid.avg 26029 ± 2% -33.9% 17216 ± 2% sched_debug.cpu.curr->pid.max 13168 ± 11% -33.3% 8788 ± 12% sched_debug.cpu.curr->pid.min 2761 ± 14% -37.3% 1730 ± 29% sched_debug.cpu.curr->pid.stddev 958735 -24.7% 721786 ± 11% sched_debug.cpu.max_idle_balance_cost.max 97977 ± 3% -52.2% 46860 ± 25% sched_debug.cpu.max_idle_balance_cost.stddev 0.00 ± 4% -41.9% 0.00 ± 10% sched_debug.cpu.next_balance.stddev 20557 ± 3% -37.3% 12889 ± 3% sched_debug.cpu.nr_switches.avg 81932 ± 7% -25.4% 61097 ± 15% sched_debug.cpu.nr_switches.max 6643 ± 6% -39.2% 4037 ± 15% sched_debug.cpu.nr_switches.min 13929 ± 6% -30.1% 9740 ± 4% sched_debug.cpu.nr_switches.stddev 20.30 ± 23% +73.3% 35.19 ± 28% sched_debug.cpu.nr_uninterruptible.max -12.43 +145.8% -30.55 sched_debug.cpu.nr_uninterruptible.min 5.28 ± 12% +82.4% 9.63 ± 13% sched_debug.cpu.nr_uninterruptible.stddev 925729 ± 2% -42.4% 533314 ± 3% sched_debug.sched_clk 36.08 +51.2% 54.54 ± 2% perf-stat.i.MPKI 1.037e+08 +9.7% 1.137e+08 perf-stat.i.branch-instructions 1.36 +0.0 1.39 perf-stat.i.branch-miss-rate% 1602349 +19.8% 1918946 ± 3% perf-stat.i.branch-misses 11889864 +56.5% 18603954 ± 2% perf-stat.i.cache-misses 17973544 +54.5% 27773059 ± 2% perf-stat.i.cache-references 1771 +3.1% 1826 perf-stat.i.context-switches 2.147e+11 +3.2% 2.215e+11 perf-stat.i.cpu-cycles 112.60 +16.6% 131.26 perf-stat.i.cpu-migrations 18460 -33.1% 12347 perf-stat.i.cycles-between-cache-misses 0.03 ± 4% +0.0 0.04 ± 8% perf-stat.i.dTLB-load-miss-rate% 52266 ± 4% +28.9% 67377 ± 7% perf-stat.i.dTLB-load-misses 1.442e+08 +7.7% 1.553e+08 perf-stat.i.dTLB-loads 0.25 +0.0 0.27 perf-stat.i.dTLB-store-miss-rate% 189901 +13.6% 215670 perf-stat.i.dTLB-store-misses 80186719 +6.8% 85653052 perf-stat.i.dTLB-stores 400061 ± 4% +18.9% 475847 ± 4% perf-stat.i.iTLB-load-misses 361622 ± 2% -18.0% 296709 ± 12% perf-stat.i.iTLB-loads 5.358e+08 +9.1% 5.845e+08 perf-stat.i.instructions 1420 ± 2% -4.7% 1353 ± 3% perf-stat.i.instructions-per-iTLB-miss 0.00 ± 5% +24.9% 0.01 ± 8% perf-stat.i.ipc 0.06 ± 8% -21.0% 0.04 ± 8% perf-stat.i.major-faults 2.44 +3.3% 2.52 perf-stat.i.metric.GHz 1592 +3.9% 1655 ± 2% perf-stat.i.metric.K/sec 2.46 +17.0% 2.88 perf-stat.i.metric.M/sec 3139 +21.5% 3815 perf-stat.i.minor-faults 52.88 +1.2 54.08 perf-stat.i.node-load-miss-rate% 238363 +53.4% 365608 perf-stat.i.node-load-misses 219288 ± 4% +35.5% 297188 perf-stat.i.node-loads 50.27 -6.3 44.01 ± 4% perf-stat.i.node-store-miss-rate% 5122757 +33.6% 6845784 ± 3% perf-stat.i.node-store-misses 5214111 +77.3% 9242761 ± 5% perf-stat.i.node-stores 3139 +21.5% 3815 perf-stat.i.page-faults 33.59 +41.4% 47.50 ± 2% perf-stat.overall.MPKI 1.54 +0.2 1.70 ± 2% perf-stat.overall.branch-miss-rate% 407.04 -6.3% 381.24 perf-stat.overall.cpi 18139 -34.2% 11935 ± 2% perf-stat.overall.cycles-between-cache-misses 0.03 ± 4% +0.0 0.04 ± 7% perf-stat.overall.dTLB-load-miss-rate% 0.24 +0.0 0.25 perf-stat.overall.dTLB-store-miss-rate% 54.20 ± 2% +8.4 62.57 ± 5% perf-stat.overall.iTLB-load-miss-rate% 1353 ± 3% -8.7% 1236 ± 5% perf-stat.overall.instructions-per-iTLB-miss 0.00 +6.8% 0.00 perf-stat.overall.ipc 51.26 ± 2% +2.9 54.20 perf-stat.overall.node-load-miss-rate% 49.80 -7.1 42.74 ± 4% perf-stat.overall.node-store-miss-rate% 1.031e+08 +10.1% 1.136e+08 perf-stat.ps.branch-instructions 1590263 +21.1% 1925935 ± 3% perf-stat.ps.branch-misses 11959851 +55.9% 18644201 ± 2% perf-stat.ps.cache-misses 17905275 +54.8% 27718539 ± 2% perf-stat.ps.cache-references 1778 +2.7% 1826 perf-stat.ps.context-switches 2.169e+11 +2.5% 2.224e+11 perf-stat.ps.cpu-cycles 112.21 +16.9% 131.18 perf-stat.ps.cpu-migrations 50162 ± 4% +30.8% 65621 ± 8% perf-stat.ps.dTLB-load-misses 1.434e+08 +8.0% 1.549e+08 perf-stat.ps.dTLB-loads 191061 +13.1% 216009 perf-stat.ps.dTLB-store-misses 79737750 +7.0% 85349028 perf-stat.ps.dTLB-stores 394455 ± 4% +20.0% 473256 ± 4% perf-stat.ps.iTLB-load-misses 5.33e+08 +9.5% 5.835e+08 perf-stat.ps.instructions 0.06 ± 9% -21.7% 0.04 ± 8% perf-stat.ps.major-faults 3088 +22.3% 3775 perf-stat.ps.minor-faults 236351 +54.5% 365167 perf-stat.ps.node-load-misses 225011 ± 4% +37.1% 308593 perf-stat.ps.node-loads 5183662 +32.7% 6879838 ± 3% perf-stat.ps.node-store-misses 5224274 +76.7% 9231733 ± 5% perf-stat.ps.node-stores 3088 +22.3% 3775 perf-stat.ps.page-faults 9.703e+11 -38.2% 5.999e+11 ± 2% perf-stat.total.instructions 14.84 ± 13% -12.7 2.15 ± 14% perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt 1.37 ± 11% -0.6 0.76 ± 31% perf-profile.calltrace.cycles-pp.evsel__read_counter.read_counters.process_interval.dispatch_events.cmd_stat 0.53 ± 72% +0.5 1.03 ± 18% perf-profile.calltrace.cycles-pp.do_anonymous_page.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault 1.26 ± 18% +0.8 2.11 ± 12% perf-profile.calltrace.cycles-pp.serial8250_console_write.console_flush_all.console_unlock.vprintk_emit._printk 1.26 ± 18% +0.9 2.13 ± 12% perf-profile.calltrace.cycles-pp.irq_work_run_list.irq_work_run.__sysvec_irq_work.sysvec_irq_work.asm_sysvec_irq_work 1.26 ± 18% +0.9 2.13 ± 12% perf-profile.calltrace.cycles-pp.irq_work_single.irq_work_run_list.irq_work_run.__sysvec_irq_work.sysvec_irq_work 1.26 ± 18% +0.9 2.13 ± 12% perf-profile.calltrace.cycles-pp._printk.irq_work_single.irq_work_run_list.irq_work_run.__sysvec_irq_work 1.26 ± 18% +0.9 2.13 ± 12% perf-profile.calltrace.cycles-pp.vprintk_emit._printk.irq_work_single.irq_work_run_list.irq_work_run 1.26 ± 18% +0.9 2.13 ± 12% perf-profile.calltrace.cycles-pp.console_unlock.vprintk_emit._printk.irq_work_single.irq_work_run_list 1.26 ± 18% +0.9 2.13 ± 12% perf-profile.calltrace.cycles-pp.console_flush_all.console_unlock.vprintk_emit._printk.irq_work_single 2.10 ±112% +6.0 8.06 ± 77% perf-profile.calltrace.cycles-pp.__libc_start_main 2.10 ±112% +6.0 8.06 ± 77% perf-profile.calltrace.cycles-pp.main.__libc_start_main 2.10 ±112% +6.0 8.06 ± 77% perf-profile.calltrace.cycles-pp.run_builtin.main.__libc_start_main 1.37 ±105% +6.0 7.34 ± 77% perf-profile.calltrace.cycles-pp.record__pushfn.perf_mmap__push.record__mmap_read_evlist.__cmd_record.cmd_record 4.80 ± 11% +9.9 14.67 ± 51% perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault 3.42 ± 17% +10.6 13.98 ± 55% perf-profile.calltrace.cycles-pp.handle_mm_fault.do_user_addr_fault.exc_page_fault.asm_exc_page_fault 3.82 ± 17% +10.6 14.41 ± 54% perf-profile.calltrace.cycles-pp.exc_page_fault.asm_exc_page_fault 4.01 ± 17% +10.6 14.60 ± 53% perf-profile.calltrace.cycles-pp.asm_exc_page_fault 3.78 ± 17% +10.6 14.40 ± 54% perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault 17.34 ± 10% -13.4 3.91 ± 9% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt 1.54 ± 94% -1.3 0.20 ± 52% perf-profile.children.cycles-pp.zero_user_segments 4.02 ± 15% -1.3 2.75 ± 16% perf-profile.children.cycles-pp.exit_to_user_mode_loop 3.52 ± 22% -1.1 2.45 ± 13% perf-profile.children.cycles-pp.get_perf_callchain 3.37 ± 6% -1.0 2.40 ± 12% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt 1.16 ± 7% -1.0 0.20 ± 39% perf-profile.children.cycles-pp.rcu_gp_kthread 1.56 ± 15% -0.8 0.71 ± 23% perf-profile.children.cycles-pp.__irq_exit_rcu 2.62 ± 23% -0.8 1.82 ± 8% perf-profile.children.cycles-pp.perf_callchain_kernel 0.92 ± 10% -0.8 0.16 ± 42% perf-profile.children.cycles-pp.rcu_gp_fqs_loop 2.66 ± 17% -0.7 1.94 ± 18% perf-profile.children.cycles-pp.perf_trace_sched_stat_runtime 0.93 ± 11% -0.7 0.24 ± 31% perf-profile.children.cycles-pp.schedule_timeout 1.53 ± 28% -0.7 0.85 ± 19% perf-profile.children.cycles-pp.perf_trace_sched_switch 2.26 ± 24% -0.6 1.62 ± 8% perf-profile.children.cycles-pp.unwind_next_frame 1.38 ± 11% -0.6 0.76 ± 31% perf-profile.children.cycles-pp.evsel__read_counter 1.89 ± 15% -0.6 1.29 ± 16% perf-profile.children.cycles-pp.__do_softirq 0.60 ± 23% -0.5 0.06 ± 74% perf-profile.children.cycles-pp.rebalance_domains 0.69 ± 20% -0.4 0.24 ± 65% perf-profile.children.cycles-pp.load_balance 1.30 ± 8% -0.4 0.88 ± 20% perf-profile.children.cycles-pp.readn 0.96 ± 16% -0.4 0.56 ± 14% perf-profile.children.cycles-pp.task_mm_cid_work 0.44 ± 22% -0.3 0.10 ± 9% perf-profile.children.cycles-pp.__evlist__disable 0.75 ± 8% -0.3 0.47 ± 24% perf-profile.children.cycles-pp.perf_read 0.72 ± 14% -0.2 0.47 ± 40% perf-profile.children.cycles-pp.pick_next_task_fair 0.63 ± 23% -0.2 0.38 ± 33% perf-profile.children.cycles-pp.put_prev_entity 0.52 ± 18% -0.2 0.28 ± 32% perf-profile.children.cycles-pp.asm_sysvec_reschedule_ipi 0.26 ± 16% -0.2 0.07 ± 48% perf-profile.children.cycles-pp.swake_up_one 0.26 ± 22% -0.2 0.08 ± 11% perf-profile.children.cycles-pp.rcu_report_qs_rdp 0.41 ± 26% -0.2 0.24 ± 30% perf-profile.children.cycles-pp.__fdget_pos 0.19 ± 16% -0.2 0.03 ±100% perf-profile.children.cycles-pp.detach_tasks 0.19 ± 52% -0.2 0.03 ±100% perf-profile.children.cycles-pp.ioctl 0.30 ± 20% -0.1 0.15 ± 49% perf-profile.children.cycles-pp.evlist__id2evsel 0.22 ± 16% -0.1 0.13 ± 57% perf-profile.children.cycles-pp.__folio_throttle_swaprate 0.22 ± 19% -0.1 0.13 ± 57% perf-profile.children.cycles-pp.blk_cgroup_congested 0.17 ± 27% -0.1 0.08 ± 22% perf-profile.children.cycles-pp.generic_exec_single 0.18 ± 24% -0.1 0.08 ± 22% perf-profile.children.cycles-pp.smp_call_function_single 0.17 ± 21% -0.1 0.10 ± 32% perf-profile.children.cycles-pp.__perf_read_group_add 0.13 ± 35% -0.1 0.08 ± 66% perf-profile.children.cycles-pp.__kmalloc 0.10 ± 31% -0.0 0.05 ± 74% perf-profile.children.cycles-pp.__perf_event_read 0.02 ±144% +0.1 0.10 ± 28% perf-profile.children.cycles-pp.mntput_no_expire 0.29 ± 19% +0.1 0.40 ± 19% perf-profile.children.cycles-pp.dput 0.05 ±101% +0.1 0.18 ± 40% perf-profile.children.cycles-pp.free_unref_page_prepare 0.04 ±152% +0.2 0.20 ± 22% perf-profile.children.cycles-pp.devkmsg_read 0.86 ± 8% +0.2 1.08 ± 19% perf-profile.children.cycles-pp.step_into 0.35 ± 27% +0.3 0.61 ± 15% perf-profile.children.cycles-pp.run_ksoftirqd 1.46 ± 18% +0.9 2.34 ± 22% perf-profile.children.cycles-pp.wait_for_lsr 1.57 ± 14% +1.0 2.61 ± 18% perf-profile.children.cycles-pp.serial8250_console_write 1.58 ± 14% +1.0 2.63 ± 19% perf-profile.children.cycles-pp.console_unlock 1.58 ± 14% +1.0 2.63 ± 19% perf-profile.children.cycles-pp.console_flush_all 1.58 ± 14% +1.1 2.63 ± 19% perf-profile.children.cycles-pp.asm_sysvec_irq_work 1.58 ± 14% +1.1 2.63 ± 19% perf-profile.children.cycles-pp.irq_work_run_list 1.57 ± 14% +1.1 2.63 ± 19% perf-profile.children.cycles-pp.sysvec_irq_work 1.57 ± 14% +1.1 2.63 ± 19% perf-profile.children.cycles-pp.__sysvec_irq_work 1.57 ± 14% +1.1 2.63 ± 19% perf-profile.children.cycles-pp.irq_work_run 1.57 ± 14% +1.1 2.63 ± 19% perf-profile.children.cycles-pp.irq_work_single 1.57 ± 14% +1.1 2.63 ± 19% perf-profile.children.cycles-pp._printk 1.57 ± 14% +1.1 2.63 ± 19% perf-profile.children.cycles-pp.vprintk_emit 2.17 ± 21% +1.3 3.44 ± 28% perf-profile.children.cycles-pp.io_serial_in 2.24 ±100% +5.8 8.06 ± 77% perf-profile.children.cycles-pp.__libc_start_main 2.24 ±100% +5.8 8.06 ± 77% perf-profile.children.cycles-pp.main 2.24 ±100% +5.8 8.06 ± 77% perf-profile.children.cycles-pp.run_builtin 1.68 ± 86% +6.4 8.06 ± 77% perf-profile.children.cycles-pp.cmd_record 0.51 ± 59% +7.9 8.43 ± 94% perf-profile.children.cycles-pp.copy_page 0.33 ±109% +7.9 8.27 ± 97% perf-profile.children.cycles-pp.folio_copy 0.36 ±101% +7.9 8.30 ± 96% perf-profile.children.cycles-pp.move_to_new_folio 0.36 ±101% +7.9 8.30 ± 96% perf-profile.children.cycles-pp.migrate_folio_extra 0.42 ± 93% +8.8 9.21 ± 88% perf-profile.children.cycles-pp.migrate_pages_batch 0.44 ± 91% +8.8 9.24 ± 88% perf-profile.children.cycles-pp.migrate_misplaced_page 0.42 ± 93% +8.8 9.22 ± 88% perf-profile.children.cycles-pp.migrate_pages 10.36 ± 5% +9.5 19.84 ± 36% perf-profile.children.cycles-pp.asm_exc_page_fault 9.62 ± 5% +9.5 19.12 ± 37% perf-profile.children.cycles-pp.exc_page_fault 7.83 ± 2% +9.6 17.45 ± 42% perf-profile.children.cycles-pp.handle_mm_fault 9.20 ± 6% +9.6 18.84 ± 38% perf-profile.children.cycles-pp.do_user_addr_fault 6.79 ± 2% +9.7 16.49 ± 44% perf-profile.children.cycles-pp.__handle_mm_fault 0.30 ±117% +9.8 10.06 ± 79% perf-profile.children.cycles-pp.do_huge_pmd_numa_page 12.05 ± 12% -11.7 0.31 ± 36% perf-profile.self.cycles-pp.asm_sysvec_apic_timer_interrupt 0.93 ± 14% -0.4 0.56 ± 14% perf-profile.self.cycles-pp.task_mm_cid_work 0.78 ± 25% -0.2 0.55 ± 10% perf-profile.self.cycles-pp.unwind_next_frame 0.29 ± 19% -0.1 0.15 ± 50% perf-profile.self.cycles-pp.evlist__id2evsel 0.20 ± 24% -0.1 0.11 ± 49% perf-profile.self.cycles-pp.exc_page_fault 0.20 ± 20% -0.1 0.12 ± 53% perf-profile.self.cycles-pp.blk_cgroup_congested 0.17 ± 14% -0.1 0.12 ± 20% perf-profile.self.cycles-pp.perf_swevent_event 0.02 ±144% +0.1 0.10 ± 28% perf-profile.self.cycles-pp.mntput_no_expire 0.12 ± 29% +0.1 0.23 ± 35% perf-profile.self.cycles-pp.mod_objcg_state 0.04 ±104% +0.1 0.17 ± 36% perf-profile.self.cycles-pp.free_unref_page_prepare 1.40 ± 13% +0.9 2.29 ± 19% perf-profile.self.cycles-pp.io_serial_in 0.50 ± 59% +7.8 8.28 ± 94% perf-profile.self.cycles-pp.copy_page Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 306a3d1a0fa6..992e460a713e 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -479,6 +479,7 @@ struct vma_numab_state { unsigned long next_scan; unsigned long next_pid_reset; unsigned long access_pids[2]; + unsigned int scan_counter; }; /* diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 373ff5f55884..2c3e17e7fc2f 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2931,20 +2931,34 @@ static void reset_ptenuma_scan(struct task_struct *p) static bool vma_is_accessed(struct vm_area_struct *vma) { unsigned long pids; + unsigned int vma_size; + unsigned int scan_threshold; + unsigned int scan_size; + + pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1]; + + if (test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids)) + return true; + + scan_size = READ_ONCE(sysctl_numa_balancing_scan_size); + /* vma size in MB */ + vma_size = (vma->vm_end - vma->vm_start) >> 20; + + /* Total scans needed to cover VMA */ + scan_threshold = (vma_size / scan_size); + /* - * Allow unconditional access first two times, so that all the (pages) - * of VMAs get prot_none fault introduced irrespective of accesses. + * Allow the scanning of half of disjoint set's VMA to induce + * prot_none fault irrespective of accesses. * This is also done to avoid any side effect of task scanning * amplifying the unfairness of disjoint set of VMAs' access. */ - if (READ_ONCE(current->mm->numa_scan_seq) < 2) - return true; - - pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1]; - return test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids); + scan_threshold = 1 + (scan_threshold >> 1); + return (READ_ONCE(vma->numab_state->scan_counter) <= scan_threshold); } -#define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay) +#define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay) +#define DISJOINT_VMA_SCAN_RENEW_THRESH 16 /* * The expensive part of numa migration is done from task_work context. @@ -3058,6 +3072,8 @@ static void task_numa_work(struct callback_head *work) /* Reset happens after 4 times scan delay of scan start */ vma->numab_state->next_pid_reset = vma->numab_state->next_scan + msecs_to_jiffies(VMA_PID_RESET_PERIOD); + + WRITE_ONCE(vma->numab_state->scan_counter, 0); } /* @@ -3068,6 +3084,13 @@ static void task_numa_work(struct callback_head *work) vma->numab_state->next_scan)) continue; + /* + * For long running tasks, renew the disjoint vma scanning + * periodically. + */ + if (mm->numa_scan_seq && !(mm->numa_scan_seq % DISJOINT_VMA_SCAN_RENEW_THRESH)) + WRITE_ONCE(vma->numab_state->scan_counter, 0); + /* Do not scan the VMA if task has not accessed */ if (!vma_is_accessed(vma)) continue; @@ -3083,6 +3106,8 @@ static void task_numa_work(struct callback_head *work) vma->numab_state->access_pids[0] = READ_ONCE(vma->numab_state->access_pids[1]); vma->numab_state->access_pids[1] = 0; } + WRITE_ONCE(vma->numab_state->scan_counter, + READ_ONCE(vma->numab_state->scan_counter) + 1); do { start = max(start, vma->vm_start);
With the numa scan enhancements [1], only the threads which had previously accessed vma are allowed to scan. While this had improved significant system time overhead, there were corner cases, which genuinely need some relaxation. For e.g., 1) Concern raised by PeterZ, where if there are N partition sets of vmas belonging to tasks, then unfairness in allowing these threads to scan could potentially amplify the side effect of some of the vmas being left unscanned. 2) Below reports of LKP numa01 benchmark regression. Currently this was handled by allowing first two scanning unconditional as indicated by mm->numa_scan_seq. This is imprecise since for some benchmark vma scanning might itself start at numa_scan_seq > 2. Solution: Allow unconditional scanning of vmas of tasks depending on vma size. This is achieved by maintaining a per vma scan counter, where f(allowed_to_scan) = f(scan_counter < vma_size / scan_size) Fixes: fc137c0ddab2 ("sched/numa: enhance vma scanning logic") regression. Result: numa01_THREAD_ALLOC result on 6.4.0-rc1 (that has w/ numascan enhancement) base-numascan base base+fix real 1m3.025s 1m24.163s 1m3.551s user 213m44.232s 251m3.638s 219m55.662s sys 6m26.598s 0m13.056s 2m35.767s numa_hit 5478165 4395752 4907431 numa_local 5478103 4395366 4907044 numa_other 62 386 387 numa_pte_updates 1989274 11606 1265014 numa_hint_faults 1756059 515 1135804 numa_hint_faults_local 971500 486 558076 numa_pages_migrated 784211 29 577728 Summary: Regression in base is recovered by allowing scanning as required. [1] https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t Reported-by: Aithal Srikanth <sraithal@amd.com> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/lkml/db995c11-08ba-9abf-812f-01407f70a5d4@amd.com/T/ Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com> --- include/linux/mm_types.h | 1 + kernel/sched/fair.c | 41 ++++++++++++++++++++++++++++++++-------- 2 files changed, 34 insertions(+), 8 deletions(-)