Message ID | 109ca1ea59b9dd6f2daf7b7fbc74e83ae074fbdf.1693287931.git.raghavendra.kt@amd.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | sched/numa: Enhance disjoint VMA scanning | expand |
Hello, kernel test robot noticed a -33.6% improvement of autonuma-benchmark.numa02.seconds on: commit: af46f3c9ca2d16485912f8b9c896ef48bbfe1388 ("[RFC PATCH V1 5/6] sched/numa: Allow recently accessed VMAs to be scanned") url: https://github.com/intel-lab-lkp/linux/commits/Raghavendra-K-T/sched-numa-Move-up-the-access-pid-reset-logic/20230829-141007 base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 2f88c8e802c8b128a155976631f4eb2ce4f3c805 patch link: https://lore.kernel.org/all/109ca1ea59b9dd6f2daf7b7fbc74e83ae074fbdf.1693287931.git.raghavendra.kt@amd.com/ patch subject: [RFC PATCH V1 5/6] sched/numa: Allow recently accessed VMAs to be scanned testcase: autonuma-benchmark test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory parameters: iterations: 4x test: numa01_THREAD_ALLOC cpufreq_governor: performance Details are as below: --------------------------------------------------------------------------------------------------> The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20230910/202309102311.84b42068-oliver.sang@intel.com ========================================================================================= compiler/cpufreq_governor/iterations/kconfig/rootfs/tbox_group/test/testcase: gcc-12/performance/4x/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-icl-2sp6/numa01_THREAD_ALLOC/autonuma-benchmark commit: 167773d1dd ("sched/numa: Increase tasks' access history") af46f3c9ca ("sched/numa: Allow recently accessed VMAs to be scanned") 167773d1ddb5ffdd af46f3c9ca2d16485912f8b9c89 ---------------- --------------------------- %stddev %change %stddev \ | \ 2.534e+10 ± 10% -13.0% 2.204e+10 ± 7% cpuidle..time 26431366 ± 10% -13.2% 22948978 ± 7% cpuidle..usage 0.15 ± 4% -0.0 0.12 ± 3% mpstat.cpu.all.soft% 2.92 ± 3% +0.4 3.32 ± 4% mpstat.cpu.all.sys% 2243 ± 2% -12.7% 1957 ± 3% uptime.boot 29811 ± 8% -11.1% 26507 ± 6% uptime.idle 5.32 ± 79% -64.2% 1.91 ± 60% perf-sched.sch_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_exc_page_fault 2.70 ± 18% +37.8% 3.72 ± 9% perf-sched.sch_delay.max.ms.schedule_hrtimeout_range_clock.do_select.core_sys_select.kern_select 0.64 ±137% +26644.2% 169.91 ±220% perf-sched.wait_time.avg.ms.__cond_resched.task_work_run.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode 0.08 ± 20% +0.0 0.12 ± 10% perf-profile.children.cycles-pp.terminate_walk 0.10 ± 25% +0.0 0.14 ± 10% perf-profile.children.cycles-pp.wake_up_q 0.06 ± 50% +0.0 0.10 ± 10% perf-profile.children.cycles-pp.vfs_readlink 0.15 ± 36% +0.1 0.22 ± 13% perf-profile.children.cycles-pp.readlink 1.31 ± 19% +0.4 1.69 ± 12% perf-profile.children.cycles-pp.unmap_vmas 2.46 ± 19% +0.5 2.99 ± 4% perf-profile.children.cycles-pp.exit_mmap 311653 ± 10% -23.7% 237884 ± 9% turbostat.C1E 26018024 ± 10% -13.1% 22597563 ± 7% turbostat.C6 6.41 ± 9% -13.6% 5.54 ± 8% turbostat.CPU%c1 2.47 ± 11% +36.0% 3.36 ± 6% turbostat.CPU%c6 2.881e+08 ± 2% -12.8% 2.513e+08 ± 3% turbostat.IRQ 212.86 +2.8% 218.84 turbostat.RAMWatt 341.49 -4.1% 327.42 ± 2% autonuma-benchmark.numa01.seconds 186.67 ± 6% -27.1% 136.12 ± 7% autonuma-benchmark.numa01_THREAD_ALLOC.seconds 21.17 ± 7% -33.6% 14.05 autonuma-benchmark.numa02.seconds 2200 ± 2% -13.0% 1913 ± 3% autonuma-benchmark.time.elapsed_time 2200 ± 2% -13.0% 1913 ± 3% autonuma-benchmark.time.elapsed_time.max 1159380 ± 2% -12.0% 1019969 ± 3% autonuma-benchmark.time.involuntary_context_switches 3363550 -5.0% 3194802 autonuma-benchmark.time.minor_page_faults 243046 ± 2% -13.3% 210725 ± 3% autonuma-benchmark.time.user_time 7494239 -6.8% 6984234 proc-vmstat.numa_hit 118829 ± 6% +13.7% 135136 ± 6% proc-vmstat.numa_huge_pte_updates 6207618 -8.4% 5686795 ± 2% proc-vmstat.numa_local 8834573 ± 3% +20.2% 10616944 ± 4% proc-vmstat.numa_pages_migrated 61094857 ± 6% +13.6% 69409875 ± 6% proc-vmstat.numa_pte_updates 8602789 -9.0% 7827793 ± 2% proc-vmstat.pgfault 8834573 ± 3% +20.2% 10616944 ± 4% proc-vmstat.pgmigrate_success 371818 -10.1% 334391 ± 2% proc-vmstat.pgreuse 17200 ± 3% +20.3% 20686 ± 4% proc-vmstat.thp_migration_success 16401792 ± 2% -12.7% 14322816 ± 3% proc-vmstat.unevictable_pgs_scanned 1.606e+08 ± 2% -13.8% 1.385e+08 ± 3% sched_debug.cfs_rq:/.avg_vruntime.avg 1.666e+08 ± 2% -14.0% 1.433e+08 ± 3% sched_debug.cfs_rq:/.avg_vruntime.max 1.364e+08 ± 2% -11.7% 1.204e+08 ± 3% sched_debug.cfs_rq:/.avg_vruntime.min 4795327 ± 7% -17.5% 3956991 ± 7% sched_debug.cfs_rq:/.avg_vruntime.stddev 1.606e+08 ± 2% -13.8% 1.385e+08 ± 3% sched_debug.cfs_rq:/.min_vruntime.avg 1.666e+08 ± 2% -14.0% 1.433e+08 ± 3% sched_debug.cfs_rq:/.min_vruntime.max 1.364e+08 ± 2% -11.7% 1.204e+08 ± 3% sched_debug.cfs_rq:/.min_vruntime.min 4795327 ± 7% -17.5% 3956991 ± 7% sched_debug.cfs_rq:/.min_vruntime.stddev 364.96 ± 6% +16.6% 425.70 ± 5% sched_debug.cfs_rq:/.util_est_enqueued.avg 1099114 -13.0% 956021 ± 2% sched_debug.cpu.clock.avg 1099477 -13.0% 956344 ± 2% sched_debug.cpu.clock.max 1098702 -13.0% 955643 ± 2% sched_debug.cpu.clock.min 1080712 -13.0% 940415 ± 2% sched_debug.cpu.clock_task.avg 1085309 -13.1% 943557 ± 2% sched_debug.cpu.clock_task.max 1064613 -13.0% 925993 ± 2% sched_debug.cpu.clock_task.min 28890 ± 3% -11.7% 25504 ± 3% sched_debug.cpu.curr->pid.avg 35200 -11.0% 31344 sched_debug.cpu.curr->pid.max 862245 ± 3% -8.7% 786984 sched_debug.cpu.max_idle_balance_cost.max 74019 ± 9% -28.2% 53158 ± 7% sched_debug.cpu.max_idle_balance_cost.stddev 15507 -11.9% 13667 ± 2% sched_debug.cpu.nr_switches.avg 57616 ± 6% -19.0% 46642 ± 8% sched_debug.cpu.nr_switches.max 8460 ± 6% -12.9% 7368 ± 5% sched_debug.cpu.nr_switches.stddev 1098689 -13.0% 955631 ± 2% sched_debug.cpu_clk 1097964 -13.0% 954907 ± 2% sched_debug.ktime 0.00 +15.0% 0.00 ± 2% sched_debug.rt_rq:.rt_nr_migratory.avg 0.03 +15.0% 0.03 ± 2% sched_debug.rt_rq:.rt_nr_migratory.max 0.00 +15.0% 0.00 ± 2% sched_debug.rt_rq:.rt_nr_migratory.stddev 0.00 +15.0% 0.00 ± 2% sched_debug.rt_rq:.rt_nr_running.avg 0.03 +15.0% 0.03 ± 2% sched_debug.rt_rq:.rt_nr_running.max 0.00 +15.0% 0.00 ± 2% sched_debug.rt_rq:.rt_nr_running.stddev 1099511 -13.0% 956501 ± 2% sched_debug.sched_clk 1162 ± 2% +15.2% 1339 ± 3% perf-stat.i.MPKI 1.656e+08 +3.6% 1.716e+08 perf-stat.i.branch-instructions 0.95 ± 4% +0.1 1.03 perf-stat.i.branch-miss-rate% 1538367 ± 6% +11.0% 1707146 ± 2% perf-stat.i.branch-misses 6.327e+08 ± 3% +18.7% 7.513e+08 ± 4% perf-stat.i.cache-misses 8.282e+08 ± 2% +15.2% 9.542e+08 ± 3% perf-stat.i.cache-references 658.12 ± 3% -11.4% 582.98 ± 6% perf-stat.i.cycles-between-cache-misses 2.201e+08 +2.8% 2.263e+08 perf-stat.i.dTLB-loads 579771 +0.9% 584915 perf-stat.i.dTLB-store-misses 1.122e+08 +1.4% 1.138e+08 perf-stat.i.dTLB-stores 8.278e+08 +3.1% 8.538e+08 perf-stat.i.instructions 13.98 ± 2% +14.3% 15.98 ± 3% perf-stat.i.metric.M/sec 3797 +4.3% 3958 perf-stat.i.minor-faults 258749 +8.0% 279391 ± 2% perf-stat.i.node-load-misses 261169 ± 2% +7.4% 280417 ± 5% perf-stat.i.node-loads 40.91 ± 3% -3.0 37.89 ± 3% perf-stat.i.node-store-miss-rate% 3.841e+08 ± 6% +27.6% 4.902e+08 ± 7% perf-stat.i.node-stores 3797 +4.3% 3958 perf-stat.i.page-faults 998.24 ± 2% +11.8% 1116 ± 2% perf-stat.overall.MPKI 463.91 -3.2% 448.99 perf-stat.overall.cpi 604.23 ± 3% -15.9% 508.08 ± 4% perf-stat.overall.cycles-between-cache-misses 0.00 +3.3% 0.00 perf-stat.overall.ipc 39.20 ± 5% -4.5 34.70 ± 6% perf-stat.overall.node-store-miss-rate% 1.636e+08 +3.8% 1.698e+08 perf-stat.ps.branch-instructions 1499760 ± 6% +11.1% 1665855 ± 2% perf-stat.ps.branch-misses 6.296e+08 ± 3% +19.0% 7.489e+08 ± 4% perf-stat.ps.cache-misses 8.178e+08 ± 2% +15.5% 9.447e+08 ± 3% perf-stat.ps.cache-references 2.18e+08 +2.9% 2.244e+08 perf-stat.ps.dTLB-loads 578148 +0.9% 583328 perf-stat.ps.dTLB-store-misses 1.117e+08 +1.4% 1.132e+08 perf-stat.ps.dTLB-stores 8.192e+08 +3.3% 8.46e+08 perf-stat.ps.instructions 3744 +4.3% 3906 perf-stat.ps.minor-faults 255974 +8.2% 276924 ± 2% perf-stat.ps.node-load-misses 263796 ± 2% +7.7% 284110 ± 5% perf-stat.ps.node-loads 3.82e+08 ± 6% +27.7% 4.879e+08 ± 7% perf-stat.ps.node-stores 3744 +4.3% 3906 perf-stat.ps.page-faults 1.805e+12 ± 2% -10.1% 1.622e+12 ± 2% perf-stat.total.instructions Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.
On 9/10/2023 8:59 PM, kernel test robot wrote: > 341.49 -4.1% 327.42 ± 2% autonuma-benchmark.numa01.seconds > 186.67 ± 6% -27.1% 136.12 ± 7% autonuma-benchmark.numa01_THREAD_ALLOC.seconds > 21.17 ± 7% -33.6% 14.05 autonuma-benchmark.numa02.seconds > 2200 ± 2% -13.0% 1913 ± 3% autonuma-benchmark.time.elapsed_time Hello Oliver/Kernel test robot, Thank yo alot for testing. Results are impressive. Can I take this result as positive for whole series too? Mel/PeterZ, Whenever time permits can you please let us know your comments/concerns on the series? Thanks and Regards - Raghu
hi, Raghu, On Mon, Sep 11, 2023 at 04:55:56PM +0530, Raghavendra K T wrote: > On 9/10/2023 8:59 PM, kernel test robot wrote: > > 341.49 -4.1% 327.42 ± 2% autonuma-benchmark.numa01.seconds > > 186.67 ± 6% -27.1% 136.12 ± 7% autonuma-benchmark.numa01_THREAD_ALLOC.seconds > > 21.17 ± 7% -33.6% 14.05 autonuma-benchmark.numa02.seconds > > 2200 ± 2% -13.0% 1913 ± 3% autonuma-benchmark.time.elapsed_time > > Hello Oliver/Kernel test robot, > Thank yo alot for testing. > > Results are impressive. Can I take this result as > positive for whole series too? FYI. we applied your patch set like below: 68cfe9439a1ba (linux-review/Raghavendra-K-T/sched-numa-Move-up-the-access-pid-reset-logic/20230829-141007) sched/numa: Allow scanning of shared VMAs af46f3c9ca2d1 sched/numa: Allow recently accessed VMAs to be scanned 167773d1ddb5f sched/numa: Increase tasks' access history fc769221b2306 sched/numa: Remove unconditional scan logic using mm numa_scan_seq 1ef5cbb92bdb3 sched/numa: Add disjoint vma unconditional scan logic 2a806eab1c2e1 sched/numa: Move up the access pid reset logic 2f88c8e802c8b (tip/sched/core) sched/eevdf/doc: Modify the documented knob to base_slice_ns as well in our tests, we also tested the 68cfe9439a1ba, if comparing it to af46f3c9ca2d1: ========================================================================================= compiler/cpufreq_governor/iterations/kconfig/rootfs/tbox_group/test/testcase: gcc-12/performance/4x/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-icl-2sp6/numa01_THREAD_ALLOC/autonuma-benchmark commit: af46f3c9ca ("sched/numa: Allow recently accessed VMAs to be scanned") 68cfe9439a ("sched/numa: Allow scanning of shared VMA") af46f3c9ca2d1648 68cfe9439a1baa642e05883fa64 ---------------- --------------------------- %stddev %change %stddev \ | \ 327.42 ± 2% -1.1% 323.83 ± 3% autonuma-benchmark.numa01.seconds 136.12 ± 7% -25.1% 101.90 ± 2% autonuma-benchmark.numa01_THREAD_ALLOC.seconds 14.05 +1.5% 14.26 autonuma-benchmark.numa02.seconds 1913 ± 3% -7.9% 1763 ± 2% autonuma-benchmark.time.elapsed_time below is the full comparison FYI. af46f3c9ca2d1648 68cfe9439a1baa642e05883fa64 ---------------- --------------------------- %stddev %change %stddev \ | \ 36437 ± 9% +20.4% 43867 ± 10% meminfo.Mapped 0.02 ± 17% +0.0 0.03 ± 8% mpstat.cpu.all.iowait% 71.00 ± 2% +6.3% 75.50 turbostat.PkgTmp 3956991 ± 7% -15.0% 3361998 ± 5% sched_debug.cfs_rq:/.avg_vruntime.stddev 3956991 ± 7% -15.0% 3361997 ± 5% sched_debug.cfs_rq:/.min_vruntime.stddev -30.18 +27.8% -38.56 sched_debug.cpu.nr_uninterruptible.min 1913 ± 3% -7.9% 1763 ± 2% time.elapsed_time 1913 ± 3% -7.9% 1763 ± 2% time.elapsed_time.max 3194802 -2.4% 3117907 time.minor_page_faults 210725 ± 3% -8.7% 192483 ± 3% time.user_time 327.42 ± 2% -1.1% 323.83 ± 3% autonuma-benchmark.numa01.seconds 136.12 ± 7% -25.1% 101.90 ± 2% autonuma-benchmark.numa01_THREAD_ALLOC.seconds 14.05 +1.5% 14.26 autonuma-benchmark.numa02.seconds 1913 ± 3% -7.9% 1763 ± 2% autonuma-benchmark.time.elapsed_time 1913 ± 3% -7.9% 1763 ± 2% autonuma-benchmark.time.elapsed_time.max 3194802 -2.4% 3117907 autonuma-benchmark.time.minor_page_faults 210725 ± 3% -8.7% 192483 ± 3% autonuma-benchmark.time.user_time 1.33 ± 91% -88.0% 0.16 ± 14% perf-sched.sch_delay.avg.ms.pipe_read.vfs_read.ksys_read.do_syscall_64 0.09 ±194% +3204.2% 3.03 ± 66% perf-sched.sch_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_reschedule_ipi 3.72 ± 9% -24.8% 2.80 ± 21% perf-sched.sch_delay.max.ms.schedule_hrtimeout_range_clock.do_select.core_sys_select.kern_select 41.00 ±147% +2060.2% 885.67 ±105% perf-sched.wait_and_delay.count.io_schedule.migration_entry_wait_on_locked.__handle_mm_fault.handle_mm_fault 18.61 ± 18% -28.5% 13.30 ± 21% perf-sched.wait_time.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone 7.84 ±100% +354.6% 35.66 ± 89% perf-sched.wait_time.max.ms.__cond_resched.__wait_for_common.wait_for_completion_state.kernel_clone.__x64_sys_vfork 9285 ± 8% +20.1% 11152 ± 10% proc-vmstat.nr_mapped 6984234 -4.0% 6706018 proc-vmstat.numa_hit 5686795 ± 2% -5.2% 5390176 proc-vmstat.numa_local 10616944 ± 4% +15.7% 12279801 ± 3% proc-vmstat.numa_pages_migrated 7827793 ± 2% -5.2% 7421440 ± 2% proc-vmstat.pgfault 10616944 ± 4% +15.7% 12279801 ± 3% proc-vmstat.pgmigrate_success 334391 ± 2% -8.6% 305628 ± 2% proc-vmstat.pgreuse 20686 ± 4% +15.7% 23939 ± 3% proc-vmstat.thp_migration_success 14322816 ± 3% -8.2% 13147392 ± 2% proc-vmstat.unevictable_pgs_scanned 1339 ± 3% +8.6% 1454 ± 2% perf-stat.i.MPKI 1.716e+08 +2.8% 1.764e+08 perf-stat.i.branch-instructions 1.03 +0.1 1.11 ± 3% perf-stat.i.branch-miss-rate% 1707146 ± 2% +9.5% 1869960 ± 4% perf-stat.i.branch-misses 7.513e+08 ± 4% +11.1% 8.351e+08 ± 3% perf-stat.i.cache-misses 9.542e+08 ± 3% +8.9% 1.04e+09 ± 3% perf-stat.i.cache-references 534.57 -1.5% 526.34 perf-stat.i.cpi 158.57 +1.6% 161.11 perf-stat.i.cpu-migrations 582.98 ± 6% -11.4% 516.40 ± 3% perf-stat.i.cycles-between-cache-misses 2.263e+08 +2.2% 2.312e+08 perf-stat.i.dTLB-loads 8.538e+08 +2.5% 8.753e+08 perf-stat.i.instructions 15.98 ± 3% +8.9% 17.40 ± 3% perf-stat.i.metric.M/sec 3958 +3.0% 4075 perf-stat.i.minor-faults 37.89 ± 3% -3.6 34.28 ± 5% perf-stat.i.node-store-miss-rate% 2.585e+08 ± 4% -7.7% 2.385e+08 ± 3% perf-stat.i.node-store-misses 4.902e+08 ± 7% +21.1% 5.937e+08 ± 7% perf-stat.i.node-stores 3958 +2.9% 4075 perf-stat.i.page-faults 1116 ± 2% +6.2% 1186 ± 2% perf-stat.overall.MPKI 0.98 +0.1 1.04 ± 3% perf-stat.overall.branch-miss-rate% 448.99 -2.8% 436.60 perf-stat.overall.cpi 508.08 ± 4% -10.1% 456.56 ± 4% perf-stat.overall.cycles-between-cache-misses 0.00 +2.8% 0.00 perf-stat.overall.ipc 34.70 ± 6% -5.7 29.02 ± 7% perf-stat.overall.node-store-miss-rate% 1.698e+08 +2.8% 1.746e+08 perf-stat.ps.branch-instructions 1665855 ± 2% +9.5% 1824511 ± 3% perf-stat.ps.branch-misses 7.489e+08 ± 4% +10.9% 8.306e+08 ± 4% perf-stat.ps.cache-misses 9.447e+08 ± 3% +8.9% 1.029e+09 ± 3% perf-stat.ps.cache-references 158.05 +1.4% 160.31 perf-stat.ps.cpu-migrations 2.244e+08 +2.1% 2.292e+08 perf-stat.ps.dTLB-loads 8.46e+08 +2.5% 8.672e+08 perf-stat.ps.instructions 3906 +2.9% 4020 perf-stat.ps.minor-faults 284110 ± 5% +12.0% 318166 ± 2% perf-stat.ps.node-loads 2.584e+08 ± 3% -7.3% 2.395e+08 ± 3% perf-stat.ps.node-store-misses 4.879e+08 ± 7% +20.6% 5.883e+08 ± 7% perf-stat.ps.node-stores 3906 +2.9% 4020 perf-stat.ps.page-faults 1.622e+12 ± 2% -5.7% 1.53e+12 ± 2% perf-stat.total.instructions 6.29 ± 13% -2.2 4.11 ± 24% perf-profile.calltrace.cycles-pp.read 6.22 ± 13% -2.2 4.05 ± 24% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.read 6.21 ± 13% -2.2 4.04 ± 24% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.read 6.04 ± 13% -2.1 3.90 ± 24% perf-profile.calltrace.cycles-pp.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read 6.09 ± 13% -2.1 3.96 ± 24% perf-profile.calltrace.cycles-pp.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe.read 3.68 ± 17% -1.4 2.25 ± 36% perf-profile.calltrace.cycles-pp.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe 3.22 ± 16% -1.4 1.79 ± 27% perf-profile.calltrace.cycles-pp.open64 3.66 ± 16% -1.4 2.24 ± 36% perf-profile.calltrace.cycles-pp.path_openat.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64 3.88 ± 13% -1.4 2.49 ± 20% perf-profile.calltrace.cycles-pp.seq_read.vfs_read.ksys_read.do_syscall_64.entry_SYSCALL_64_after_hwframe 3.83 ± 13% -1.4 2.48 ± 19% perf-profile.calltrace.cycles-pp.seq_read_iter.seq_read.vfs_read.ksys_read.do_syscall_64 3.03 ± 17% -1.3 1.71 ± 26% perf-profile.calltrace.cycles-pp.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64 3.09 ± 17% -1.3 1.77 ± 27% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.open64 3.08 ± 17% -1.3 1.76 ± 27% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64 3.04 ± 17% -1.3 1.73 ± 26% perf-profile.calltrace.cycles-pp.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64 2.61 ± 14% -1.0 1.60 ± 20% perf-profile.calltrace.cycles-pp.proc_single_show.seq_read_iter.seq_read.vfs_read.ksys_read 2.58 ± 13% -1.0 1.58 ± 21% perf-profile.calltrace.cycles-pp.do_task_stat.proc_single_show.seq_read_iter.seq_read.vfs_read 0.99 ± 17% -0.5 0.46 ± 75% perf-profile.calltrace.cycles-pp.__xstat64 0.97 ± 18% -0.5 0.46 ± 75% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__xstat64 0.96 ± 18% -0.5 0.46 ± 75% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__xstat64 0.95 ± 18% -0.5 0.45 ± 75% perf-profile.calltrace.cycles-pp.__do_sys_newstat.do_syscall_64.entry_SYSCALL_64_after_hwframe.__xstat64 0.92 ± 19% -0.5 0.45 ± 75% perf-profile.calltrace.cycles-pp.vfs_fstatat.__do_sys_newstat.do_syscall_64.entry_SYSCALL_64_after_hwframe.__xstat64 0.72 ± 12% -0.3 0.40 ± 71% perf-profile.calltrace.cycles-pp.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt 7.12 ± 13% -2.4 4.73 ± 22% perf-profile.children.cycles-pp.ksys_read 6.91 ± 12% -2.3 4.57 ± 23% perf-profile.children.cycles-pp.vfs_read 6.30 ± 13% -2.2 4.12 ± 24% perf-profile.children.cycles-pp.read 5.34 ± 12% -1.9 3.46 ± 25% perf-profile.children.cycles-pp.seq_read_iter 4.65 ± 13% -1.7 2.98 ± 31% perf-profile.children.cycles-pp.do_sys_openat2 4.67 ± 13% -1.7 3.01 ± 30% perf-profile.children.cycles-pp.__x64_sys_openat 4.43 ± 13% -1.6 2.86 ± 29% perf-profile.children.cycles-pp.do_filp_open 4.41 ± 13% -1.6 2.85 ± 29% perf-profile.children.cycles-pp.path_openat 3.23 ± 16% -1.4 1.80 ± 27% perf-profile.children.cycles-pp.open64 3.89 ± 13% -1.4 2.49 ± 20% perf-profile.children.cycles-pp.seq_read 2.61 ± 14% -1.0 1.60 ± 20% perf-profile.children.cycles-pp.proc_single_show 2.59 ± 13% -1.0 1.58 ± 21% perf-profile.children.cycles-pp.do_task_stat 1.66 ± 12% -0.7 0.96 ± 36% perf-profile.children.cycles-pp.lookup_fast 1.43 ± 16% -0.6 0.86 ± 29% perf-profile.children.cycles-pp.walk_component 1.50 ± 14% -0.5 0.96 ± 30% perf-profile.children.cycles-pp.link_path_walk 1.24 ± 10% -0.5 0.77 ± 32% perf-profile.children.cycles-pp.do_open 1.53 ± 7% -0.4 1.08 ± 19% perf-profile.children.cycles-pp.sched_setaffinity 1.02 ± 15% -0.4 0.64 ± 33% perf-profile.children.cycles-pp.__xstat64 1.10 ± 18% -0.4 0.72 ± 31% perf-profile.children.cycles-pp.__do_sys_newstat 1.09 ± 18% -0.4 0.73 ± 30% perf-profile.children.cycles-pp.path_lookupat 1.10 ± 18% -0.4 0.74 ± 29% perf-profile.children.cycles-pp.filename_lookup 1.07 ± 19% -0.4 0.72 ± 32% perf-profile.children.cycles-pp.vfs_fstatat 0.97 ± 9% -0.4 0.62 ± 34% perf-profile.children.cycles-pp.do_dentry_open 0.82 ± 19% -0.4 0.48 ± 34% perf-profile.children.cycles-pp.__d_lookup_rcu 0.94 ± 18% -0.3 0.61 ± 35% perf-profile.children.cycles-pp.vfs_statx 0.61 ± 11% -0.3 0.33 ± 32% perf-profile.children.cycles-pp.pid_revalidate 0.78 ± 14% -0.3 0.50 ± 29% perf-profile.children.cycles-pp.tlb_finish_mmu 0.64 ± 15% -0.3 0.37 ± 29% perf-profile.children.cycles-pp.getdents64 0.62 ± 16% -0.3 0.35 ± 28% perf-profile.children.cycles-pp.proc_pid_readdir 0.64 ± 15% -0.3 0.37 ± 29% perf-profile.children.cycles-pp.__x64_sys_getdents64 0.64 ± 15% -0.3 0.37 ± 29% perf-profile.children.cycles-pp.iterate_dir 0.61 ± 15% -0.3 0.35 ± 24% perf-profile.children.cycles-pp.__percpu_counter_init 0.96 ± 8% -0.3 0.71 ± 20% perf-profile.children.cycles-pp.evlist_cpu_iterator__next 1.03 ± 12% -0.2 0.78 ± 15% perf-profile.children.cycles-pp.__libc_read 0.75 ± 8% -0.2 0.53 ± 17% perf-profile.children.cycles-pp.__x64_sys_sched_setaffinity 0.39 ± 13% -0.2 0.19 ± 24% perf-profile.children.cycles-pp.__entry_text_start 0.40 ± 18% -0.2 0.22 ± 25% perf-profile.children.cycles-pp.ptrace_may_access 0.62 ± 7% -0.2 0.45 ± 17% perf-profile.children.cycles-pp.__sched_setaffinity 0.36 ± 16% -0.2 0.20 ± 25% perf-profile.children.cycles-pp.proc_fill_cache 0.57 ± 6% -0.2 0.40 ± 20% perf-profile.children.cycles-pp.__set_cpus_allowed_ptr 0.42 ± 21% -0.2 0.27 ± 38% perf-profile.children.cycles-pp.inode_permission 0.36 ± 20% -0.1 0.22 ± 25% perf-profile.children.cycles-pp._find_next_bit 0.39 ± 14% -0.1 0.25 ± 22% perf-profile.children.cycles-pp.__kmem_cache_alloc_node 0.44 ± 12% -0.1 0.30 ± 26% perf-profile.children.cycles-pp.pick_link 0.25 ± 18% -0.1 0.12 ± 19% perf-profile.children.cycles-pp.security_ptrace_access_check 0.32 ± 15% -0.1 0.19 ± 22% perf-profile.children.cycles-pp.__x64_sys_readlink 0.22 ± 13% -0.1 0.11 ± 33% perf-profile.children.cycles-pp.readlink 0.31 ± 14% -0.1 0.19 ± 22% perf-profile.children.cycles-pp.do_readlinkat 0.32 ± 11% -0.1 0.22 ± 30% perf-profile.children.cycles-pp.vfs_fstat 0.26 ± 19% -0.1 0.15 ± 26% perf-profile.children.cycles-pp.load_elf_interp 0.22 ± 17% -0.1 0.12 ± 32% perf-profile.children.cycles-pp.d_hash_and_lookup 0.21 ± 31% -0.1 0.12 ± 31% perf-profile.children.cycles-pp.may_open 0.30 ± 14% -0.1 0.21 ± 18% perf-profile.children.cycles-pp.copy_strings 0.24 ± 18% -0.1 0.14 ± 32% perf-profile.children.cycles-pp.unlink_anon_vmas 0.19 ± 19% -0.1 0.10 ± 32% perf-profile.children.cycles-pp.__kmalloc_node 0.29 ± 8% -0.1 0.21 ± 10% perf-profile.children.cycles-pp.affine_move_task 0.24 ± 21% -0.1 0.16 ± 24% perf-profile.children.cycles-pp.__mod_memcg_lruvec_state 0.22 ± 10% -0.1 0.14 ± 28% perf-profile.children.cycles-pp.mas_preallocate 0.24 ± 12% -0.1 0.16 ± 30% perf-profile.children.cycles-pp.mas_alloc_nodes 0.21 ± 14% -0.1 0.14 ± 20% perf-profile.children.cycles-pp.__d_alloc 0.10 ± 19% -0.1 0.03 ±100% perf-profile.children.cycles-pp.pid_task 0.14 ± 24% -0.1 0.06 ± 50% perf-profile.children.cycles-pp.single_open 0.20 ± 11% -0.1 0.12 ± 12% perf-profile.children.cycles-pp.cpu_stop_queue_work 0.18 ± 16% -0.1 0.11 ± 25% perf-profile.children.cycles-pp.generic_fillattr 0.14 ± 19% -0.1 0.07 ± 29% perf-profile.children.cycles-pp.apparmor_ptrace_access_check 0.14 ± 23% -0.1 0.08 ± 30% perf-profile.children.cycles-pp.native_flush_tlb_one_user 0.10 ± 10% -0.1 0.04 ± 71% perf-profile.children.cycles-pp.vfs_readlink 0.09 ± 19% -0.1 0.03 ±100% perf-profile.children.cycles-pp.aa_get_task_label 0.14 ± 25% -0.1 0.08 ± 23% perf-profile.children.cycles-pp.proc_pid_get_link 0.16 ± 21% -0.1 0.10 ± 28% perf-profile.children.cycles-pp.thread_group_cputime_adjusted 0.19 ± 15% -0.1 0.13 ± 27% perf-profile.children.cycles-pp.strnlen_user 0.18 ± 27% -0.1 0.11 ± 21% perf-profile.children.cycles-pp.wq_worker_comm 0.18 ± 13% -0.1 0.11 ± 36% perf-profile.children.cycles-pp.vfs_getattr_nosec 0.17 ± 16% -0.1 0.11 ± 24% perf-profile.children.cycles-pp.proc_pid_cmdline_read 0.12 ± 10% -0.1 0.06 ± 48% perf-profile.children.cycles-pp.terminate_walk 0.14 ± 18% -0.1 0.09 ± 27% perf-profile.children.cycles-pp.thread_group_cputime 0.13 ± 21% -0.0 0.08 ± 27% perf-profile.children.cycles-pp.get_obj_cgroup_from_current 0.14 ± 18% -0.0 0.10 ± 26% perf-profile.children.cycles-pp.get_mm_cmdline 0.14 ± 10% -0.0 0.10 ± 17% perf-profile.children.cycles-pp.wake_up_q 1.37 ± 16% -0.6 0.81 ± 23% perf-profile.self.cycles-pp.do_task_stat 0.80 ± 18% -0.3 0.46 ± 34% perf-profile.self.cycles-pp.__d_lookup_rcu 0.39 ± 15% -0.2 0.19 ± 33% perf-profile.self.cycles-pp.pid_revalidate 0.37 ± 11% -0.2 0.18 ± 22% perf-profile.self.cycles-pp.__entry_text_start 0.36 ± 14% -0.2 0.21 ± 37% perf-profile.self.cycles-pp.do_dentry_open 0.44 ± 17% -0.1 0.31 ± 24% perf-profile.self.cycles-pp.gather_pte_stats 0.23 ± 15% -0.1 0.14 ± 14% perf-profile.self.cycles-pp.__kmem_cache_alloc_node 0.10 ± 18% -0.1 0.03 ±100% perf-profile.self.cycles-pp.pid_task 0.21 ± 17% -0.1 0.14 ± 25% perf-profile.self.cycles-pp.__mod_memcg_lruvec_state 0.14 ± 23% -0.1 0.08 ± 30% perf-profile.self.cycles-pp.native_flush_tlb_one_user 0.16 ± 23% -0.1 0.09 ± 26% perf-profile.self.cycles-pp.generic_fillattr 0.09 ± 20% -0.1 0.03 ±101% perf-profile.self.cycles-pp.unlink_anon_vmas 0.10 ± 25% -0.1 0.04 ± 76% perf-profile.self.cycles-pp.proc_fill_cache 0.12 ± 20% -0.1 0.06 ± 58% perf-profile.self.cycles-pp.lookup_fast > > Mel/PeterZ, > > Whenever time permits can you please let us know your comments/concerns > on the series? > > Thanks and Regards > - Raghu >
On 9/12/2023 7:52 AM, Oliver Sang wrote: > hi, Raghu, > > On Mon, Sep 11, 2023 at 04:55:56PM +0530, Raghavendra K T wrote: >> On 9/10/2023 8:59 PM, kernel test robot wrote: >>> 341.49 -4.1% 327.42 ± 2% autonuma-benchmark.numa01.seconds >>> 186.67 ± 6% -27.1% 136.12 ± 7% autonuma-benchmark.numa01_THREAD_ALLOC.seconds >>> 21.17 ± 7% -33.6% 14.05 autonuma-benchmark.numa02.seconds >>> 2200 ± 2% -13.0% 1913 ± 3% autonuma-benchmark.time.elapsed_time >> >> Hello Oliver/Kernel test robot, >> Thank yo alot for testing. >> >> Results are impressive. Can I take this result as >> positive for whole series too? > > FYI. we applied your patch set like below: > > 68cfe9439a1ba (linux-review/Raghavendra-K-T/sched-numa-Move-up-the-access-pid-reset-logic/20230829-141007) sched/numa: Allow scanning of shared VMAs > af46f3c9ca2d1 sched/numa: Allow recently accessed VMAs to be scanned > 167773d1ddb5f sched/numa: Increase tasks' access history > fc769221b2306 sched/numa: Remove unconditional scan logic using mm numa_scan_seq > 1ef5cbb92bdb3 sched/numa: Add disjoint vma unconditional scan logic > 2a806eab1c2e1 sched/numa: Move up the access pid reset logic > 2f88c8e802c8b (tip/sched/core) sched/eevdf/doc: Modify the documented knob to base_slice_ns as well > > in our tests, we also tested the 68cfe9439a1ba, if comparing it to af46f3c9ca2d1: > > ========================================================================================= > compiler/cpufreq_governor/iterations/kconfig/rootfs/tbox_group/test/testcase: > gcc-12/performance/4x/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-icl-2sp6/numa01_THREAD_ALLOC/autonuma-benchmark > > commit: > af46f3c9ca ("sched/numa: Allow recently accessed VMAs to be scanned") > 68cfe9439a ("sched/numa: Allow scanning of shared VMA") > > af46f3c9ca2d1648 68cfe9439a1baa642e05883fa64 > ---------------- --------------------------- > %stddev %change %stddev > \ | \ > 327.42 ± 2% -1.1% 323.83 ± 3% autonuma-benchmark.numa01.seconds > 136.12 ± 7% -25.1% 101.90 ± 2% autonuma-benchmark.numa01_THREAD_ALLOC.seconds > 14.05 +1.5% 14.26 autonuma-benchmark.numa02.seconds > 1913 ± 3% -7.9% 1763 ± 2% autonuma-benchmark.time.elapsed_time > > > below is the full comparison FYI. > Thanks a lot for further run and details. Combining this result with previous, we do have a very good result overall for LKP. 167773d1dd ("sched/numa: Increase tasks' access history") af46f3c9ca ("sched/numa: Allow recently accessed VMAs to be scanned") 167773d1ddb5ffdd af46f3c9ca2d16485912f8b9c89 ---------------- --------------------------- %stddev %change %stddev 341.49 -4.1% 327.42 ± 2% autonuma-benchmark.numa01.seconds 186.67 ± 6% -27.1% 136.12 ± 7% autonuma-benchmark.numa01_THREAD_ALLOC.seconds 21.17 ± 7% -33.6% 14.05 autonuma-benchmark.numa02.seconds 2200 ± 2% -13.0% 1913 ± 3% autonuma-benchmark.time.elapsed_time Thanks and Regards - Raghu > > > >> >> Mel/PeterZ, >> >> Whenever time permits can you please let us know your comments/concerns >> on the series? >> >> Thanks and Regards >> - Raghu >>
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 3ae2a1a3ef5c..6529da7f370a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2971,8 +2971,22 @@ static inline bool vma_test_access_pid_history(struct vm_area_struct *vma) return test_bit(pid_bit, &pids); } +static inline bool vma_accessed_recent(struct vm_area_struct *vma) +{ + unsigned long *pids, pid_idx; + + pid_idx = vma->numab_state->access_pid_idx; + pids = vma->numab_state->access_pids + pid_idx; + + return (bitmap_weight(pids, BITS_PER_LONG) >= 1); +} + static bool vma_is_accessed(struct vm_area_struct *vma) { + /* Check at least one task had accessed VMA recently. */ + if (vma_accessed_recent(vma)) + return true; + /* Check if the current task had historically accessed VMA. */ if (vma_test_access_pid_history(vma)) return true;
This ensures hot VMAs get scanned on priority irresepctive of their access by current task. Suggested-by: Bharata B Rao <bharata@amd.com> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com> --- kernel/sched/fair.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)