Message ID | 20231004183623.17067-3-tony.luck@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Handle corrected machine check interrupt storms | expand |
Hello, kernel test robot noticed a -8.8% regression of stress-ng.clock.ops_per_sec on: commit: 26bff7b04b829cccc6a97726d6398391a62e34ef ("[PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation") url: https://github.com/intel-lab-lkp/linux/commits/Tony-Luck/x86-mce-Remove-old-CMCI-storm-mitigation-code/20231005-024047 patch link: https://lore.kernel.org/all/20231004183623.17067-3-tony.luck@intel.com/ patch subject: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation testcase: stress-ng test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory parameters: nr_threads: 10% disk: 1HDD testtime: 60s fs: ext4 class: os test: clock cpufreq_governor: performance If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <oliver.sang@intel.com> | Closes: https://lore.kernel.org/oe-lkp/202310111637.dee70328-oliver.sang@intel.com Details are as below: --------------------------------------------------------------------------------------------------> The kernel config and materials to reproduce are available at: https://download.01.org/0day-ci/archive/20231011/202310111637.dee70328-oliver.sang@intel.com ========================================================================================= class/compiler/cpufreq_governor/disk/fs/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime: os/gcc-12/performance/1HDD/ext4/x86_64-rhel-8.3/10%/debian-11.1-x86_64-20220510.cgz/lkp-icl-2sp7/clock/stress-ng/60s commit: cbb2ef4fc1 ("x86/mce: Remove old CMCI storm mitigation code") 26bff7b04b ("x86/mce: Add per-bank CMCI storm mitigation") cbb2ef4fc1f946b3 26bff7b04b829cccc6a97726d63 ---------------- --------------------------- %stddev %change %stddev \ | \ 118898 ± 7% -18.5% 96883 ± 8% meminfo.Mapped 2971 ± 7% +18.3% 3513 ± 9% perf-c2c.HITM.local 0.01 -20.0% 0.00 perf-sched.wait_time.avg.ms.do_nanosleep.hrtimer_nanosleep.common_nsleep_timens.__x64_sys_clock_nanosleep 3865971 ± 4% -23.2% 2970833 ± 2% turbostat.C1 1.48 ± 6% -0.3 1.22 ± 4% turbostat.C1% 292422 -9.4% 264956 vmstat.system.cs 219398 -7.4% 203113 vmstat.system.in 4715828 -8.8% 4301560 stress-ng.clock.ops 78597 -8.8% 71692 stress-ng.clock.ops_per_sec 305.30 +1.7% 310.35 stress-ng.time.system_time 9425269 -8.8% 8596889 stress-ng.time.voluntary_context_switches 157229 ± 2% -5.1% 149224 proc-vmstat.nr_active_anon 904759 -1.7% 889679 proc-vmstat.nr_file_pages 123722 ± 2% -6.1% 116162 ± 2% proc-vmstat.nr_inactive_anon 29778 ± 7% -18.5% 24260 ± 8% proc-vmstat.nr_mapped 194663 -7.8% 179535 proc-vmstat.nr_shmem 157229 ± 2% -5.1% 149224 proc-vmstat.nr_zone_active_anon 123722 ± 2% -6.1% 116162 ± 2% proc-vmstat.nr_zone_inactive_anon 844968 -5.5% 798662 proc-vmstat.numa_hit 778755 -5.9% 732457 ± 2% proc-vmstat.numa_local 259547 -5.7% 244756 proc-vmstat.pgactivate 1060836 -7.1% 986009 ± 2% proc-vmstat.pgalloc_normal 755630 ± 2% -7.0% 702996 ± 4% proc-vmstat.pgfree 2.70 -17.4% 2.23 perf-stat.i.MPKI 1.428e+09 -5.8% 1.345e+09 perf-stat.i.branch-instructions 0.99 +0.1 1.09 ± 2% perf-stat.i.branch-miss-rate% 35.43 -7.4 28.07 perf-stat.i.cache-miss-rate% 19762268 -22.2% 15365818 perf-stat.i.cache-misses 55484296 -1.5% 54670651 perf-stat.i.cache-references 305445 -8.7% 278849 perf-stat.i.context-switches 2.88 +6.9% 3.08 perf-stat.i.cpi 400.01 ± 3% -6.5% 373.89 ± 3% perf-stat.i.cpu-migrations 1090 +27.2% 1387 perf-stat.i.cycles-between-cache-misses 1.989e+09 -6.6% 1.857e+09 perf-stat.i.dTLB-loads 1.153e+09 -7.6% 1.066e+09 perf-stat.i.dTLB-stores 7.703e+09 -6.3% 7.221e+09 perf-stat.i.instructions 0.38 -5.8% 0.36 perf-stat.i.ipc 1124 -6.3% 1053 perf-stat.i.metric.K/sec 71.46 -6.6% 66.71 perf-stat.i.metric.M/sec 3530238 ± 3% +48.1% 5229217 ± 3% perf-stat.i.node-load-misses 90.35 -5.3 85.09 perf-stat.i.node-store-miss-rate% 14871425 ± 2% -40.5% 8846683 perf-stat.i.node-store-misses 1393999 ± 2% +10.0% 1533309 perf-stat.i.node-stores 2.57 -17.0% 2.13 perf-stat.overall.MPKI 1.03 ± 2% +0.1 1.14 ± 3% perf-stat.overall.branch-miss-rate% 35.62 -7.5 28.11 perf-stat.overall.cache-miss-rate% 2.73 +6.4% 2.90 perf-stat.overall.cpi 1063 +28.2% 1363 perf-stat.overall.cycles-between-cache-misses 0.37 -6.0% 0.34 perf-stat.overall.ipc 91.42 -6.2 85.23 perf-stat.overall.node-store-miss-rate% 1.405e+09 -5.8% 1.323e+09 perf-stat.ps.branch-instructions 19444141 -22.2% 15118353 perf-stat.ps.cache-misses 54591369 -1.5% 53789356 perf-stat.ps.cache-references 300511 -8.7% 274335 perf-stat.ps.context-switches 393.53 ± 3% -6.5% 367.80 ± 3% perf-stat.ps.cpu-migrations 1.957e+09 -6.6% 1.827e+09 perf-stat.ps.dTLB-loads 1.135e+09 -7.6% 1.049e+09 perf-stat.ps.dTLB-stores 7.58e+09 -6.3% 7.105e+09 perf-stat.ps.instructions 3473672 ± 3% +48.1% 5144934 ± 3% perf-stat.ps.node-load-misses 14630992 ± 2% -40.5% 8703485 perf-stat.ps.node-store-misses 1371912 ± 2% +10.0% 1508626 perf-stat.ps.node-stores 4.721e+11 -5.7% 4.45e+11 perf-stat.total.instructions 13.27 -1.8 11.46 ± 2% perf-profile.calltrace.cycles-pp.ktime_get_update_offsets_now.clock_was_set.timekeeping_inject_offset.do_adjtimex.__do_sys_clock_adjtime 7.00 ± 3% -1.3 5.74 ± 2% perf-profile.calltrace.cycles-pp.syscall 5.50 ± 2% -1.2 4.26 ± 6% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.syscall 5.25 ± 2% -1.2 4.02 ± 6% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall 4.27 ± 2% -1.1 3.13 ± 9% perf-profile.calltrace.cycles-pp.__x64_sys_clock_gettime.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall 14.93 -1.1 13.82 ± 2% perf-profile.calltrace.cycles-pp.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify 17.13 -1.1 16.06 ± 2% perf-profile.calltrace.cycles-pp.secondary_startup_64_no_verify 14.04 -1.0 13.00 ± 2% perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry 14.22 -1.0 13.18 ± 2% perf-profile.calltrace.cycles-pp.cpuidle_enter.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary 16.74 -1.0 15.77 ± 2% perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify 16.78 -1.0 15.80 ± 2% perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64_no_verify 16.77 -1.0 15.80 ± 2% perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify 3.29 ± 7% -0.8 2.51 ± 13% perf-profile.calltrace.cycles-pp.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt 2.50 ± 7% -0.6 1.94 ± 6% perf-profile.calltrace.cycles-pp.intel_idle_irq.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle 0.68 ± 3% -0.4 0.26 ±100% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.timekeeping_inject_offset.do_adjtimex.__do_sys_clock_adjtime 2.09 -0.4 1.67 ± 2% perf-profile.calltrace.cycles-pp.timekeeping_update.timekeeping_inject_offset.do_adjtimex.__do_sys_clock_adjtime.do_syscall_64 4.87 -0.4 4.49 ± 4% perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call.do_idle 4.70 -0.4 4.31 ± 4% perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call 1.99 ± 7% -0.4 1.62 ± 6% perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.intel_idle_irq.cpuidle_enter_state.cpuidle_enter.cpuidle_idle_call 1.78 ± 7% -0.3 1.44 ± 6% perf-profile.calltrace.cycles-pp.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.intel_idle_irq.cpuidle_enter_state.cpuidle_enter 1.40 ± 8% -0.3 1.08 ± 6% perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.intel_idle_irq.cpuidle_enter_state 1.40 ± 8% -0.3 1.08 ± 6% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.intel_idle_irq 0.56 ± 4% -0.3 0.26 ±100% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__close 3.07 ± 3% -0.3 2.80 ± 5% perf-profile.calltrace.cycles-pp.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state 3.08 ± 3% -0.3 2.81 ± 5% perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter 1.74 ± 6% -0.2 1.51 ± 6% perf-profile.calltrace.cycles-pp.timer_settime 2.12 ± 2% -0.2 1.90 ± 5% perf-profile.calltrace.cycles-pp.add_device_randomness.do_adjtimex.__do_sys_clock_adjtime.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.95 -0.2 0.73 ± 4% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.timekeeping_inject_offset.do_adjtimex.__do_sys_clock_adjtime.do_syscall_64 1.58 ± 6% -0.2 1.36 ± 7% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.timer_settime 0.71 ± 7% -0.2 0.50 ± 45% perf-profile.calltrace.cycles-pp.posix_get_monotonic_raw.__x64_sys_clock_gettime.do_syscall_64.entry_SYSCALL_64_after_hwframe.syscall 1.60 ± 6% -0.2 1.39 ± 6% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.timer_settime 1.55 ± 7% -0.2 1.33 ± 6% perf-profile.calltrace.cycles-pp.__x64_sys_timer_settime.do_syscall_64.entry_SYSCALL_64_after_hwframe.timer_settime 0.70 ± 7% -0.2 0.49 ± 46% perf-profile.calltrace.cycles-pp.ktime_get_raw_ts64.posix_get_monotonic_raw.__x64_sys_clock_gettime.do_syscall_64.entry_SYSCALL_64_after_hwframe 1.34 ± 8% -0.2 1.14 ± 7% perf-profile.calltrace.cycles-pp.do_timer_settime.__x64_sys_timer_settime.do_syscall_64.entry_SYSCALL_64_after_hwframe.timer_settime 2.84 -0.2 2.64 ± 4% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.clock_nanosleep 2.79 ± 2% -0.2 2.61 ± 4% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep 2.30 -0.2 2.12 perf-profile.calltrace.cycles-pp.open64 2.24 -0.2 2.07 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64 2.25 -0.2 2.08 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.open64 2.23 -0.2 2.06 perf-profile.calltrace.cycles-pp.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64 2.22 -0.2 2.06 perf-profile.calltrace.cycles-pp.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64 1.36 ± 4% -0.2 1.20 ± 18% perf-profile.calltrace.cycles-pp.__entry_text_start.syscall 1.79 ± 4% -0.2 1.63 ± 6% perf-profile.calltrace.cycles-pp.tick_sched_timer.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt 2.70 ± 2% -0.2 2.54 ± 4% perf-profile.calltrace.cycles-pp.__x64_sys_clock_nanosleep.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep 1.12 ± 3% -0.1 0.97 ± 6% perf-profile.calltrace.cycles-pp.blake2s_update.add_device_randomness.do_adjtimex.__do_sys_clock_adjtime.do_syscall_64 0.60 ± 4% -0.1 0.46 ± 44% perf-profile.calltrace.cycles-pp.__close 2.02 ± 2% -0.1 1.88 perf-profile.calltrace.cycles-pp.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64.entry_SYSCALL_64_after_hwframe 2.00 ± 2% -0.1 1.86 perf-profile.calltrace.cycles-pp.path_openat.do_filp_open.do_sys_openat2.__x64_sys_openat.do_syscall_64 0.95 ± 3% -0.1 0.82 ± 6% perf-profile.calltrace.cycles-pp.blake2s_compress.blake2s_update.add_device_randomness.do_adjtimex.__do_sys_clock_adjtime 1.26 ± 2% -0.1 1.13 ± 3% perf-profile.calltrace.cycles-pp.do_open.path_openat.do_filp_open.do_sys_openat2.__x64_sys_openat 1.07 ± 3% -0.1 0.98 ± 5% perf-profile.calltrace.cycles-pp.hrtimer_nanosleep.common_nsleep_timens.__x64_sys_clock_nanosleep.do_syscall_64.entry_SYSCALL_64_after_hwframe 1.08 ± 3% -0.1 0.99 ± 5% perf-profile.calltrace.cycles-pp.common_nsleep_timens.__x64_sys_clock_nanosleep.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_nanosleep 0.86 ± 2% -0.1 0.78 ± 6% perf-profile.calltrace.cycles-pp.hrtimer_start_range_ns.do_nanosleep.hrtimer_nanosleep.common_nsleep.__x64_sys_clock_nanosleep 1.01 ± 3% -0.1 0.93 ± 6% perf-profile.calltrace.cycles-pp.do_nanosleep.hrtimer_nanosleep.common_nsleep_timens.__x64_sys_clock_nanosleep.do_syscall_64 0.96 -0.1 0.88 ± 2% perf-profile.calltrace.cycles-pp.read_tsc.ktime_get_update_offsets_now.clock_was_set.timekeeping_inject_offset.do_adjtimex 0.69 ± 4% -0.1 0.62 ± 3% perf-profile.calltrace.cycles-pp.chrdev_open.do_dentry_open.do_open.path_openat.do_filp_open 0.98 ± 4% -0.1 0.91 perf-profile.calltrace.cycles-pp.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.poll_idle.cpuidle_enter_state 0.86 ± 4% -0.1 0.80 ± 3% perf-profile.calltrace.cycles-pp.do_dentry_open.do_open.path_openat.do_filp_open.do_sys_openat2 0.96 ± 3% +0.1 1.09 ± 5% perf-profile.calltrace.cycles-pp.tick_nohz_idle_enter.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify 0.87 ± 6% +0.1 1.00 ± 5% perf-profile.calltrace.cycles-pp.posix_timer_unhash_and_free.__x64_sys_timer_delete.do_syscall_64.entry_SYSCALL_64_after_hwframe.timer_delete 6.08 ± 2% +0.4 6.44 ± 2% perf-profile.calltrace.cycles-pp.igb_ptp_gettimex_82580.pc_clock_gettime.__x64_sys_clock_gettime.do_syscall_64.entry_SYSCALL_64_after_hwframe 6.27 ± 2% +0.4 6.66 ± 2% perf-profile.calltrace.cycles-pp.pc_clock_gettime.__x64_sys_clock_gettime.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_gettime 6.39 ± 2% +0.4 6.78 ± 2% perf-profile.calltrace.cycles-pp.clock_gettime 6.32 ± 2% +0.4 6.72 ± 2% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.clock_gettime 6.32 ± 2% +0.4 6.71 ± 2% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_gettime 6.30 ± 2% +0.4 6.70 ± 2% perf-profile.calltrace.cycles-pp.__x64_sys_clock_gettime.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_gettime 0.08 ±223% +0.5 0.55 ± 5% perf-profile.calltrace.cycles-pp.asm_sysvec_apic_timer_interrupt.tick_nohz_idle_enter.do_idle.cpu_startup_entry.start_secondary 2.58 ± 7% +0.5 3.13 ± 2% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.igb_ptp_gettimex_82580.pc_clock_gettime.__x64_sys_clock_gettime.do_syscall_64 2.30 ± 7% +0.6 2.89 ± 2% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.igb_ptp_gettimex_82580.pc_clock_gettime.__x64_sys_clock_gettime 53.43 +2.9 56.32 perf-profile.calltrace.cycles-pp.clock_adjtime 53.05 +2.9 55.98 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.clock_adjtime 52.98 +2.9 55.92 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_adjtime 52.81 +2.9 55.75 perf-profile.calltrace.cycles-pp.__do_sys_clock_adjtime.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_adjtime 51.37 +3.1 54.49 perf-profile.calltrace.cycles-pp.do_adjtimex.__do_sys_clock_adjtime.do_syscall_64.entry_SYSCALL_64_after_hwframe.clock_adjtime 47.33 +3.4 50.75 perf-profile.calltrace.cycles-pp.timekeeping_inject_offset.do_adjtimex.__do_sys_clock_adjtime.do_syscall_64.entry_SYSCALL_64_after_hwframe 43.22 +4.3 47.50 perf-profile.calltrace.cycles-pp.clock_was_set.timekeeping_inject_offset.do_adjtimex.__do_sys_clock_adjtime.do_syscall_64 4.17 ± 13% +4.7 8.85 ± 4% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.clock_was_set.timekeeping_inject_offset.do_adjtimex 21.83 +5.8 27.67 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.clock_was_set.timekeeping_inject_offset.do_adjtimex.__do_sys_clock_adjtime 14.07 -1.8 12.26 ± 2% perf-profile.children.cycles-pp.ktime_get_update_offsets_now 7.34 -1.4 5.98 ± 4% perf-profile.children.cycles-pp.syscall 15.24 -1.2 14.06 ± 2% perf-profile.children.cycles-pp.cpuidle_idle_call 14.48 -1.1 13.37 ± 2% perf-profile.children.cycles-pp.cpuidle_enter_state 14.51 -1.1 13.40 ± 2% perf-profile.children.cycles-pp.cpuidle_enter 17.11 -1.1 16.04 ± 2% perf-profile.children.cycles-pp.do_idle 17.13 -1.1 16.06 ± 2% perf-profile.children.cycles-pp.secondary_startup_64_no_verify 17.13 -1.1 16.06 ± 2% perf-profile.children.cycles-pp.cpu_startup_entry 16.78 -1.0 15.80 ± 2% perf-profile.children.cycles-pp.start_secondary 10.61 ± 2% -0.7 9.86 ± 3% perf-profile.children.cycles-pp.__x64_sys_clock_gettime 9.62 -0.7 8.90 ± 3% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt 8.93 -0.7 8.26 ± 3% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt 2.76 ± 6% -0.6 2.14 ± 6% perf-profile.children.cycles-pp.intel_idle_irq 6.40 -0.6 5.81 ± 3% perf-profile.children.cycles-pp.hrtimer_interrupt 6.41 -0.6 5.83 ± 3% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt 4.11 ± 2% -0.5 3.60 ± 4% perf-profile.children.cycles-pp.__hrtimer_run_queues 1.09 ± 4% -0.5 0.62 ± 8% perf-profile.children.cycles-pp.ktime_get_real_ts64 0.88 ± 6% -0.4 0.45 ± 14% perf-profile.children.cycles-pp.posix_get_realtime_timespec 2.09 -0.4 1.67 ± 2% perf-profile.children.cycles-pp.timekeeping_update 2.12 ± 2% -0.2 1.90 ± 5% perf-profile.children.cycles-pp.add_device_randomness 1.78 ± 6% -0.2 1.56 ± 5% perf-profile.children.cycles-pp.timer_settime 1.55 ± 7% -0.2 1.34 ± 6% perf-profile.children.cycles-pp.__x64_sys_timer_settime 1.35 ± 8% -0.2 1.14 ± 7% perf-profile.children.cycles-pp.do_timer_settime 1.19 ± 3% -0.2 0.99 ± 5% perf-profile.children.cycles-pp.try_to_wake_up 1.19 ± 3% -0.2 0.99 ± 4% perf-profile.children.cycles-pp.hrtimer_wakeup 0.77 ± 4% -0.2 0.58 ± 8% perf-profile.children.cycles-pp.ktime_get_with_offset 0.52 ± 7% -0.2 0.33 ± 17% perf-profile.children.cycles-pp.posix_get_monotonic_timespec 0.51 ± 6% -0.2 0.32 ± 19% perf-profile.children.cycles-pp.ktime_get_ts64 2.25 -0.2 2.08 perf-profile.children.cycles-pp.do_sys_openat2 2.31 -0.2 2.14 perf-profile.children.cycles-pp.open64 2.26 -0.2 2.08 perf-profile.children.cycles-pp.__x64_sys_openat 2.10 ± 3% -0.2 1.92 ± 5% perf-profile.children.cycles-pp.tick_sched_timer 2.71 ± 2% -0.2 2.54 ± 4% perf-profile.children.cycles-pp.__x64_sys_clock_nanosleep 2.55 -0.2 2.39 ± 4% perf-profile.children.cycles-pp.hrtimer_nanosleep 2.45 ± 2% -0.2 2.30 ± 5% perf-profile.children.cycles-pp.do_nanosleep 1.12 ± 3% -0.1 0.98 ± 6% perf-profile.children.cycles-pp.blake2s_update 0.36 ± 6% -0.1 0.21 ± 17% perf-profile.children.cycles-pp.smp_call_function_many_cond 0.39 ± 3% -0.1 0.24 ± 4% perf-profile.children.cycles-pp.update_vsyscall 0.72 ± 7% -0.1 0.57 ± 13% perf-profile.children.cycles-pp.posix_get_monotonic_raw 0.53 -0.1 0.38 ± 8% perf-profile.children.cycles-pp.posix_get_tai_timespec 0.70 ± 7% -0.1 0.56 ± 14% perf-profile.children.cycles-pp.ktime_get_raw_ts64 2.03 ± 2% -0.1 1.89 perf-profile.children.cycles-pp.path_openat 2.04 ± 2% -0.1 1.91 perf-profile.children.cycles-pp.do_filp_open 0.95 ± 3% -0.1 0.82 ± 6% perf-profile.children.cycles-pp.blake2s_compress 1.50 ± 2% -0.1 1.37 ± 2% perf-profile.children.cycles-pp.read_tsc 1.26 ± 3% -0.1 1.14 ± 3% perf-profile.children.cycles-pp.do_open 1.38 ± 2% -0.1 1.26 ± 6% perf-profile.children.cycles-pp.hrtimer_start_range_ns 0.54 ± 4% -0.1 0.43 ± 3% perf-profile.children.cycles-pp.ttwu_do_activate 0.26 ± 8% -0.1 0.16 ± 21% perf-profile.children.cycles-pp.ktime_get_coarse_real_ts64 0.25 ± 6% -0.1 0.14 ± 21% perf-profile.children.cycles-pp.posix_get_realtime_coarse 1.47 ± 2% -0.1 1.37 ± 3% perf-profile.children.cycles-pp.__entry_text_start 1.09 ± 3% -0.1 0.99 ± 5% perf-profile.children.cycles-pp.common_nsleep_timens 0.56 ± 3% -0.1 0.46 ± 7% perf-profile.children.cycles-pp.update_fast_timekeeper 1.53 ± 3% -0.1 1.44 ± 3% perf-profile.children.cycles-pp.tick_sched_handle 1.14 ± 2% -0.1 1.05 ± 2% perf-profile.children.cycles-pp._copy_to_user 0.51 ± 2% -0.1 0.43 ± 4% perf-profile.children.cycles-pp.timekeeping_forward_now 0.16 ± 7% -0.1 0.08 ± 18% perf-profile.children.cycles-pp.__flush_smp_call_function_queue 0.32 ± 3% -0.1 0.25 ± 7% perf-profile.children.cycles-pp.enqueue_entity 0.69 ± 4% -0.1 0.62 ± 4% perf-profile.children.cycles-pp.chrdev_open 0.36 ± 3% -0.1 0.29 ± 7% perf-profile.children.cycles-pp.activate_task 0.15 ± 9% -0.1 0.08 ± 23% perf-profile.children.cycles-pp.asm_sysvec_call_function_single 0.34 ± 4% -0.1 0.28 ± 7% perf-profile.children.cycles-pp.enqueue_task_fair 0.30 ± 7% -0.1 0.24 ± 6% perf-profile.children.cycles-pp.capable 0.54 ± 4% -0.1 0.48 ± 3% perf-profile.children.cycles-pp.syscall_exit_to_user_mode 0.40 ± 7% -0.1 0.34 ± 11% perf-profile.children.cycles-pp.timerfd_clock_was_set 0.62 ± 4% -0.1 0.56 ± 4% perf-profile.children.cycles-pp.__close 0.57 ± 4% -0.1 0.51 ± 6% perf-profile.children.cycles-pp.menu_select 0.54 ± 3% -0.1 0.48 ± 4% perf-profile.children.cycles-pp.__x64_sys_close 0.31 ± 9% -0.1 0.25 ± 5% perf-profile.children.cycles-pp.lockref_get_not_dead 0.28 ± 5% -0.1 0.22 ± 6% perf-profile.children.cycles-pp.apparmor_capable 0.14 ± 16% -0.1 0.08 ± 27% perf-profile.children.cycles-pp.ktime_mono_to_any 0.13 ± 8% -0.1 0.07 ± 20% perf-profile.children.cycles-pp.sysvec_call_function_single 0.32 ± 8% -0.1 0.26 ± 5% perf-profile.children.cycles-pp.__legitimize_path 0.30 ± 5% -0.1 0.24 ± 5% perf-profile.children.cycles-pp.security_capable 0.08 ± 17% -0.1 0.02 ± 99% perf-profile.children.cycles-pp.security_file_free 0.52 ± 5% -0.1 0.47 ± 4% perf-profile.children.cycles-pp.alloc_empty_file 0.11 ± 6% -0.1 0.06 ± 21% perf-profile.children.cycles-pp.__sysvec_call_function_single 0.45 ± 4% -0.1 0.40 ± 6% perf-profile.children.cycles-pp.__fput 0.40 ± 5% -0.1 0.35 ± 4% perf-profile.children.cycles-pp.init_file 0.31 ± 5% -0.1 0.26 ± 6% perf-profile.children.cycles-pp.notifier_call_chain 0.69 ± 3% -0.0 0.64 ± 2% perf-profile.children.cycles-pp.kmem_cache_alloc 0.18 ± 9% -0.0 0.14 ± 8% perf-profile.children.cycles-pp.kobject_get 0.18 ± 9% -0.0 0.13 ± 14% perf-profile.children.cycles-pp.check_preempt_curr 0.23 ± 6% -0.0 0.19 ± 8% perf-profile.children.cycles-pp._raw_spin_lock_irq 0.15 ± 11% -0.0 0.11 ± 14% perf-profile.children.cycles-pp.resched_curr 0.29 ± 5% -0.0 0.25 ± 5% perf-profile.children.cycles-pp._raw_spin_unlock_irqrestore 0.60 ± 3% -0.0 0.56 ± 2% perf-profile.children.cycles-pp.lapic_next_deadline 0.22 ± 5% -0.0 0.19 ± 8% perf-profile.children.cycles-pp.security_file_alloc 0.18 ± 6% -0.0 0.15 ± 12% perf-profile.children.cycles-pp.smpboot_thread_fn 0.11 ± 8% -0.0 0.08 ± 5% perf-profile.children.cycles-pp.run_ksoftirqd 0.26 ± 5% -0.0 0.23 ± 4% perf-profile.children.cycles-pp.syscall_return_via_sysret 0.14 ± 9% -0.0 0.11 ± 9% perf-profile.children.cycles-pp.getname_flags 0.24 ± 3% -0.0 0.22 ± 4% perf-profile.children.cycles-pp.exit_to_user_mode_prepare 0.10 ± 6% +0.0 0.12 ± 6% perf-profile.children.cycles-pp.posix_clock_poll 0.12 ± 6% +0.0 0.14 ± 4% perf-profile.children.cycles-pp.do_poll 0.27 ± 5% +0.0 0.29 ± 3% perf-profile.children.cycles-pp.down_read 0.15 ± 9% +0.0 0.18 ± 7% perf-profile.children.cycles-pp.up_read 0.12 ± 6% +0.0 0.15 ± 6% perf-profile.children.cycles-pp.link_path_walk 0.18 ± 5% +0.0 0.22 ± 7% perf-profile.children.cycles-pp.__switch_to 0.21 ± 4% +0.1 0.26 ± 13% perf-profile.children.cycles-pp.finish_task_switch 0.00 +0.1 0.08 ± 21% perf-profile.children.cycles-pp.step_into 0.87 ± 6% +0.1 1.00 ± 5% perf-profile.children.cycles-pp.posix_timer_unhash_and_free 6.08 ± 2% +0.4 6.45 ± 2% perf-profile.children.cycles-pp.igb_ptp_gettimex_82580 6.27 ± 2% +0.4 6.66 ± 2% perf-profile.children.cycles-pp.pc_clock_gettime 6.39 ± 2% +0.4 6.78 ± 2% perf-profile.children.cycles-pp.clock_gettime 77.76 +1.4 79.16 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe 77.38 +1.4 78.81 perf-profile.children.cycles-pp.do_syscall_64 53.58 +2.9 56.46 perf-profile.children.cycles-pp.clock_adjtime 52.83 +2.9 55.78 perf-profile.children.cycles-pp.__do_sys_clock_adjtime 51.38 +3.1 54.50 perf-profile.children.cycles-pp.do_adjtimex 47.34 +3.4 50.76 perf-profile.children.cycles-pp.timekeeping_inject_offset 43.40 +4.3 47.66 perf-profile.children.cycles-pp.clock_was_set 9.11 ± 8% +5.2 14.27 ± 2% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath 29.22 +6.0 35.19 perf-profile.children.cycles-pp._raw_spin_lock_irqsave 13.04 -1.7 11.30 ± 2% perf-profile.self.cycles-pp.ktime_get_update_offsets_now 1.03 ± 5% -0.5 0.57 ± 8% perf-profile.self.cycles-pp.ktime_get_real_ts64 4.24 ± 4% -0.3 3.92 ± 5% perf-profile.self.cycles-pp.ktime_get 0.81 ± 4% -0.2 0.58 ± 7% perf-profile.self.cycles-pp.intel_idle_irq 0.70 ± 5% -0.2 0.51 ± 9% perf-profile.self.cycles-pp.ktime_get_with_offset 0.47 ± 7% -0.2 0.29 ± 21% perf-profile.self.cycles-pp.ktime_get_ts64 0.39 ± 3% -0.1 0.24 ± 4% perf-profile.self.cycles-pp.update_vsyscall 0.66 ± 8% -0.1 0.52 ± 15% perf-profile.self.cycles-pp.ktime_get_raw_ts64 0.74 ± 3% -0.1 0.61 ± 4% perf-profile.self.cycles-pp.__do_sys_clock_adjtime 0.31 ± 7% -0.1 0.18 ± 16% perf-profile.self.cycles-pp.smp_call_function_many_cond 0.95 ± 3% -0.1 0.82 ± 5% perf-profile.self.cycles-pp.blake2s_compress 0.80 ± 2% -0.1 0.68 ± 3% perf-profile.self.cycles-pp.timekeeping_update 1.43 ± 2% -0.1 1.32 ± 2% perf-profile.self.cycles-pp.read_tsc 0.26 ± 7% -0.1 0.15 ± 22% perf-profile.self.cycles-pp.ktime_get_coarse_real_ts64 0.98 ± 4% -0.1 0.87 ± 5% perf-profile.self.cycles-pp.poll_idle 0.56 ± 3% -0.1 0.46 ± 7% perf-profile.self.cycles-pp.update_fast_timekeeper 0.32 ± 5% -0.1 0.23 ± 5% perf-profile.self.cycles-pp.timekeeping_inject_offset 1.12 ± 2% -0.1 1.03 ± 2% perf-profile.self.cycles-pp._copy_to_user 0.50 ± 2% -0.1 0.41 ± 5% perf-profile.self.cycles-pp.timekeeping_forward_now 0.51 ± 7% -0.1 0.45 ± 3% perf-profile.self.cycles-pp.hrtimer_interrupt 0.30 ± 8% -0.1 0.24 ± 3% perf-profile.self.cycles-pp.lockref_get_not_dead 0.10 ± 9% -0.1 0.04 ± 45% perf-profile.self.cycles-pp.enqueue_entity 0.13 ± 18% -0.1 0.08 ± 25% perf-profile.self.cycles-pp.ktime_mono_to_any 0.27 ± 6% -0.1 0.21 ± 5% perf-profile.self.cycles-pp.apparmor_capable 0.16 ± 8% -0.1 0.10 ± 21% perf-profile.self.cycles-pp.hrtimer_start_range_ns 0.18 ± 9% -0.0 0.14 ± 8% perf-profile.self.cycles-pp.kobject_get 0.15 ± 11% -0.0 0.11 ± 14% perf-profile.self.cycles-pp.resched_curr 0.59 ± 2% -0.0 0.56 ± 2% perf-profile.self.cycles-pp.lapic_next_deadline 0.41 ± 4% -0.0 0.38 ± 3% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe 0.33 ± 3% -0.0 0.30 ± 5% perf-profile.self.cycles-pp.do_syscall_64 0.20 ± 7% -0.0 0.17 ± 5% perf-profile.self.cycles-pp.syscall_exit_to_user_mode 0.26 ± 5% -0.0 0.23 ± 4% perf-profile.self.cycles-pp.syscall_return_via_sysret 0.19 ± 3% -0.0 0.16 ± 8% perf-profile.self.cycles-pp.apparmor_file_alloc_security 0.10 ± 4% -0.0 0.08 ± 8% perf-profile.self.cycles-pp.update_curr 0.14 ± 11% +0.0 0.18 ± 7% perf-profile.self.cycles-pp.up_read 0.17 ± 5% +0.0 0.21 ± 6% perf-profile.self.cycles-pp.__switch_to 0.01 ±223% +0.1 0.08 ± 8% perf-profile.self.cycles-pp.dequeue_task_fair 20.56 +0.9 21.48 perf-profile.self.cycles-pp._raw_spin_lock_irqsave 9.08 ± 8% +5.2 14.24 ± 2% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.
> kernel test robot noticed a -8.8% regression of stress-ng.clock.ops_per_sec on: > > > commit: 26bff7b04b829cccc6a97726d6398391a62e34ef ("[PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation") > url: https://github.com/intel-lab-lkp/linux/commits/Tony-Luck/x86-mce-Remove-old-CMCI-storm-mitigation-code/20231005-024047 > patch link: https://lore.kernel.org/all/20231004183623.17067-3-tony.luck@intel.com/ > patch subject: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation > > testcase: stress-ng > test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory > parameters: > > nr_threads: 10% > disk: 1HDD > testtime: 60s > fs: ext4 > class: os > test: clock > cpufreq_governor: performance > > > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > the same patch/commit), kindly add following tags > | Reported-by: kernel test robot <oliver.sang@intel.com> > | Closes: https://lore.kernel.org/oe-lkp/202310111637.dee70328-oliver.sang@intel.com Is the test injecting massive numbers of corrected memory errors? The code in this patch is only executed when handling CMCI interrupts, or polling machine check banks (at most once per second). I'm guessing this report is just because alignment of some hot path code changed. -Tony
On Wed, Oct 11, 2023 at 11:16:46PM +0800, Luck, Tony wrote: > > kernel test robot noticed a -8.8% regression of stress-ng.clock.ops_per_sec on: > > > > > > commit: 26bff7b04b829cccc6a97726d6398391a62e34ef ("[PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation") > > url: https://github.com/intel-lab-lkp/linux/commits/Tony-Luck/x86-mce-Remove-old-CMCI-storm-mitigation-code/20231005-024047 > > patch link: https://lore.kernel.org/all/20231004183623.17067-3-tony.luck@intel.com/ > > patch subject: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation > > > > testcase: stress-ng > > test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory > > parameters: > > > > nr_threads: 10% > > disk: 1HDD > > testtime: 60s > > fs: ext4 > > class: os > > test: clock > > cpufreq_governor: performance > > > > > > > > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > > the same patch/commit), kindly add following tags > > | Reported-by: kernel test robot <oliver.sang@intel.com> > > | Closes: https://lore.kernel.org/oe-lkp/202310111637.dee70328-oliver.sang@intel.com > > Is the test injecting massive numbers of corrected memory errors? The code in this patch > is only executed when handling CMCI interrupts, or polling machine check banks (at most > once per second). > > I'm guessing this report is just because alignment of some hot path code changed. IIRC, CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B was enabled in 0Day's kernel config for quite a while, to force each funtion's start address aligned on 64 bytes. Don't know if this has been changed recently. Also I noticed the patch introduce a new per-cpu variable 'storm_desc", if the function address is 64B aligned, then per-cpu data alignment may be related. Thanks, Feng > > -Tony
> IIRC, CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B was enabled in 0Day's > kernel config for quite a while, to force each funtion's start > address aligned on 64 bytes. Don't know if this has been changed > recently. > > Also I noticed the patch introduce a new per-cpu variable 'storm_desc", > if the function address is 64B aligned, then per-cpu data alignment > may be related. If adding (removing, changing) per-cpu variables can cause 8% performance changes, then maybe there needs to be some way to insulate the builds from these side effects (as was done with the 64-byte function alignment). I've no ideas on how to achieve that :-( "storm_desc" is a largish per-cpu variable at 1568 bytes. But that doesn't quite get it into the top 20. In the config used for my kernel build, it does appear just before a bunch of interrupt related per-cpu variables. Those might be sensitive to alignment? 000000000001ff40 d smca_misc_banks_map 000000000001ff48 d bank_map 000000000001ff50 d threshold_banks 000000000001ff60 d storm_desc 0000000000020580 d tsc_adjust 00000000000205c0 d lapic_events 00000000000206c0 d vector_cleanup 0000000000020720 D __uv_cpu_info 0000000000020738 d ipi_mask -Tony
On Wed, Oct 11, 2023 at 11:42:26PM +0800, Feng Tang wrote: > On Wed, Oct 11, 2023 at 11:16:46PM +0800, Luck, Tony wrote: > > > kernel test robot noticed a -8.8% regression of stress-ng.clock.ops_per_sec on: > > > > > > > > > commit: 26bff7b04b829cccc6a97726d6398391a62e34ef ("[PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation") > > > url: https://github.com/intel-lab-lkp/linux/commits/Tony-Luck/x86-mce-Remove-old-CMCI-storm-mitigation-code/20231005-024047 > > > patch link: https://lore.kernel.org/all/20231004183623.17067-3-tony.luck@intel.com/ > > > patch subject: [PATCH v9 2/3] x86/mce: Add per-bank CMCI storm mitigation > > > > > > testcase: stress-ng > > > test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory > > > parameters: > > > > > > nr_threads: 10% > > > disk: 1HDD > > > testtime: 60s > > > fs: ext4 > > > class: os > > > test: clock > > > cpufreq_governor: performance > > > > > > > > > > > > > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > > > the same patch/commit), kindly add following tags > > > | Reported-by: kernel test robot <oliver.sang@intel.com> > > > | Closes: https://lore.kernel.org/oe-lkp/202310111637.dee70328-oliver.sang@intel.com > > > > Is the test injecting massive numbers of corrected memory errors? The code in this patch > > is only executed when handling CMCI interrupts, or polling machine check banks (at most > > once per second). > > > > I'm guessing this report is just because alignment of some hot path code changed. > > IIRC, CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B was enabled in 0Day's > kernel config for quite a while, to force each funtion's start > address aligned on 64 bytes. Don't know if this has been changed > recently. yes, the kernel is always built with -falign-functions=64 to test performance now. > > Also I noticed the patch introduce a new per-cpu variable 'storm_desc", > if the function address is 64B aligned, then per-cpu data alignment > may be related. > > Thanks, > Feng > > > > > -Tony >
On Thu, Oct 12, 2023 at 01:23:54AM +0800, Luck, Tony wrote: > > IIRC, CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B was enabled in 0Day's > > kernel config for quite a while, to force each funtion's start > > address aligned on 64 bytes. Don't know if this has been changed > > recently. > > > > Also I noticed the patch introduce a new per-cpu variable 'storm_desc", > > if the function address is 64B aligned, then per-cpu data alignment > > may be related. > > If adding (removing, changing) per-cpu variables can cause 8% performance > changes, then maybe there needs to be some way to insulate the builds > from these side effects (as was done with the 64-byte function alignment). > I've no ideas on how to achieve that :-( Philip helped to clarify that 0Day kernel build does enable the 64 bytes function address alignment. > "storm_desc" is a largish per-cpu variable at 1568 bytes. But that doesn't > quite get it into the top 20. > > In the config used for my kernel build, it does appear just before a bunch > of interrupt related per-cpu variables. Those might be sensitive to alignment? > > 000000000001ff40 d smca_misc_banks_map > 000000000001ff48 d bank_map > 000000000001ff50 d threshold_banks > 000000000001ff60 d storm_desc > 0000000000020580 d tsc_adjust > 00000000000205c0 d lapic_events > 00000000000206c0 d vector_cleanup > 0000000000020720 D __uv_cpu_info > 0000000000020738 d ipi_mask I did further check by putting 'storm_desc' into another section with below patch, and the regression is restored, which indictes it is related with these per-cpu data alignment change. diff --git a/arch/x86/kernel/cpu/mce/threshold.c b/arch/x86/kernel/cpu/mce/threshold.c index ecdf13f1bb7d5..7f32126f216bc 100644 --- a/arch/x86/kernel/cpu/mce/threshold.c +++ b/arch/x86/kernel/cpu/mce/threshold.c @@ -41,7 +41,7 @@ struct mca_storm_desc { bool poll_mode; }; -static DEFINE_PER_CPU(struct mca_storm_desc, storm_desc); +static DEFINE_PER_CPU_SHARED_ALIGNED(struct mca_storm_desc, storm_desc); Thanks, Feng > > -Tony
On Thu, Oct 12, 2023 at 01:36:41PM +0800, Feng Tang wrote: > On Thu, Oct 12, 2023 at 01:23:54AM +0800, Luck, Tony wrote: > > > IIRC, CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B was enabled in 0Day's > > > kernel config for quite a while, to force each funtion's start > > > address aligned on 64 bytes. Don't know if this has been changed > > > recently. > > > > > > Also I noticed the patch introduce a new per-cpu variable 'storm_desc", > > > if the function address is 64B aligned, then per-cpu data alignment > > > may be related. > > > > If adding (removing, changing) per-cpu variables can cause 8% performance > > changes, then maybe there needs to be some way to insulate the builds > > from these side effects (as was done with the 64-byte function alignment). > > I've no ideas on how to achieve that :-( As for mitigation (how to reduce these kind of strange performance changes), I have proposed a "selective isolation" in LPC 2021, in page 15 of https://lpc.events/event/11/contributions/895/attachments/770/1603/Strange_kernel_performance_changes_lpc_2021.pdf As kernel binary is compactly linked together, the alignment change of text/data in earlier modules could affect other modules after them. MCE module is much easier to hit these strange performance changes, just because it sit in arch/ folder and get linked early in kernel binary. The idea of "selective isolation" is simple, by picking some modules sparsely and enforce some alignment to function/data/per-cpu data, so that they can act as fences/borders, separate kernel into multiple capsules, and make the alignment changes only take effect inside that specific capsule. Any thoughts? thanks! - Feng > Philip helped to clarify that 0Day kernel build does enable the 64 bytes > function address alignment. [...]
On Wed, Oct 04, 2023 at 11:36:22AM -0700, Tony Luck wrote: > +/* > + * history: bitmask tracking whether errors were seen or not seen in > + * the most recent polls of a bank. each bit in that bitmask represents an error seen. > + * timestamp: last time (in jiffies) that the bank was polled > + * storm: Is this bank in storm mode? > + */ > +struct storm_bank { > + u64 history; > + u64 timestamp; > + bool storm; I guess "in_storm_mode" is even more descriptive: storm->banks[bank].in_storm_mode = false; etc. > +}; > + > +/* How many errors within the history buffer mark the start of a storm. */ > +#define STORM_BEGIN_THRESHOLD 5 > + > +/* > + * How many polls of machine check bank without an error before declaring > + * the storm is over. Since it is tracked by the bitmaks in the history > + * field of struct storm_bank the mask is 30 bits [0 ... 29]. > + */ > +#define STORM_END_POLL_THRESHOLD 29 > > #ifdef CONFIG_ACPI_APEI > int apei_write_mce(struct mce *m); > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c > index f6e87443b37a..7c931f0c9251 100644 > --- a/arch/x86/kernel/cpu/mce/core.c > +++ b/arch/x86/kernel/cpu/mce/core.c > @@ -680,6 +680,8 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b) > barrier(); > m.status = mce_rdmsrl(mca_msr_reg(i, MCA_STATUS)); > > + mce_track_storm(&m); Lemme see if I understand the idea here: the default polling interval is 5 mins. So the storm tracking is called every 5 mins once to see how are banks "doing", so to speak and to get them in or out of storm mode. So far so good... > + > /* If this entry is not valid, ignore it */ > if (!(m.status & MCI_STATUS_VAL)) > continue; Btw, you're tracking storm even if the error is not valid - conditional above here. Why? > @@ -1652,22 +1654,29 @@ static void mce_timer_fn(struct timer_list *t) > else > iv = min(iv * 2, round_jiffies_relative(check_interval * HZ)); > > - __this_cpu_write(mce_next_interval, iv); > - __start_timer(t, iv); > + if (mce_is_storm()) { > + __start_timer(t, HZ); > + } else { > + __this_cpu_write(mce_next_interval, iv); > + __start_timer(t, iv); > + } ... this is where it becomes, hm, interesting: the check interval will be halved if an error has been seen during this round but then if we're in storm mode, that check interval doesn't matter - you'll run the timer each second. Then you need to restructure this to check the storm condition and not do anything to iv if storm. Or, am I missing something? > diff --git a/arch/x86/kernel/cpu/mce/threshold.c b/arch/x86/kernel/cpu/mce/threshold.c > index ef4e7bb5fd88..ecdf13f1bb7d 100644 > --- a/arch/x86/kernel/cpu/mce/threshold.c > +++ b/arch/x86/kernel/cpu/mce/threshold.c > @@ -29,3 +29,115 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_threshold) > trace_threshold_apic_exit(THRESHOLD_APIC_VECTOR); > apic_eoi(); > } > + > +/* > + * banks: per-cpu, per-bank details > + * stormy_bank_count: count of MC banks in storm state > + * poll_mode: CPU is in poll mode > + */ > +struct mca_storm_desc { > + struct storm_bank banks[MAX_NR_BANKS]; > + u8 stormy_bank_count; > + bool poll_mode; > +}; Yeah, put the struct definition into internal.h pls. > +static DEFINE_PER_CPU(struct mca_storm_desc, storm_desc); > + > +void mce_inherit_storm(unsigned int bank) > +{ > + struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc); > + > + storm->banks[bank].history = ~0ull; So upon inheriting a bank, you set its history that it has seen errors each time? That's weird. > + storm->banks[bank].timestamp = jiffies; > +} > + > +bool mce_is_storm(void) That's a weird name. mce_get_storm_mode() perhaps? > +{ > + return __this_cpu_read(storm_desc.poll_mode); > +} > + > +void mce_set_storm(bool storm) mce_set_storm_mode() > +{ > + __this_cpu_write(storm_desc.poll_mode, storm); > +} > + > +static void mce_handle_storm(unsigned int bank, bool on) > +{ > + switch (boot_cpu_data.x86_vendor) { > + } > +} ... > +void mce_track_storm(struct mce *mce) > +{ > + struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc); > + unsigned long now = jiffies, delta; > + unsigned int shift = 1; > + u64 history = 0; > + > + /* > + * When a bank is in storm mode it is polled once per second and > + * the history mask will record about the last minute of poll results. > + * If it is not in storm mode, then the bank is only checked when > + * there is a CMCI interrupt. Check how long it has been since > + * this bank was last checked, and adjust the amount of "shift" > + * to apply to history. > + */ > + if (!storm->banks[mce->bank].storm) { Yeah, if this were if (!storm->banks[mce->bank].in_storm_mode) it would've been perfectly clear what the condition tests. > + delta = now - storm->banks[mce->bank].timestamp; > + shift = (delta + HZ) / HZ; > + } > + > + /* If it has been a long time since the last poll, clear history. */ > + if (shift < 64) Use a properly named define instead of a naked number. ... Thx.
On Thu, Oct 19, 2023 at 05:12:11PM +0200, Borislav Petkov wrote: > On Wed, Oct 04, 2023 at 11:36:22AM -0700, Tony Luck wrote: > > +/* > > + * history: bitmask tracking whether errors were seen or not seen in > > + * the most recent polls of a bank. > > each bit in that bitmask represents an error seen. Yes. That reads better. > > > + * timestamp: last time (in jiffies) that the bank was polled > > + * storm: Is this bank in storm mode? > > + */ > > +struct storm_bank { > > + u64 history; > > + u64 timestamp; > > + bool storm; > > I guess "in_storm_mode" is even more descriptive: > > storm->banks[bank].in_storm_mode = false; > > etc. Will fix. > > > +}; > > + > > +/* How many errors within the history buffer mark the start of a storm. */ > > +#define STORM_BEGIN_THRESHOLD 5 > > + > > +/* > > + * How many polls of machine check bank without an error before declaring > > + * the storm is over. Since it is tracked by the bitmaks in the history > > + * field of struct storm_bank the mask is 30 bits [0 ... 29]. > > + */ > > +#define STORM_END_POLL_THRESHOLD 29 > > > > #ifdef CONFIG_ACPI_APEI > > int apei_write_mce(struct mce *m); > > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c > > index f6e87443b37a..7c931f0c9251 100644 > > --- a/arch/x86/kernel/cpu/mce/core.c > > +++ b/arch/x86/kernel/cpu/mce/core.c > > @@ -680,6 +680,8 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b) > > barrier(); > > m.status = mce_rdmsrl(mca_msr_reg(i, MCA_STATUS)); > > > > + mce_track_storm(&m); > > Lemme see if I understand the idea here: > > the default polling interval is 5 mins. So the storm tracking is called > every 5 mins once to see how are banks "doing", so to speak and to get > them in or out of storm mode. So far so good... > > > + > > /* If this entry is not valid, ignore it */ > > if (!(m.status & MCI_STATUS_VAL)) > > continue; > > Btw, you're tracking storm even if the error is not valid - conditional > above here. Why? I want to track whether each bank is in storm mode, or not. But there's no indication when a CMCI is delivered which bank was the source. Code just has to scan all the banks, and might find more than one with an error. While no bank is in polling mode, there isn't a set time interval between scanning banks. A scan is just triggered when a CMCI happens. So it's non-trivial to compute a rate. Would require saving a timestamp for every logged error. In a simple case there's just one bank responsible for a ton of CMCI. No need for complexity here, the count of interrupts from that bank will hit a threshold and a storm is declared. But more complex scenarois are possible. Other banks may trigger small numbers of CMCI. Not enough to call it a storm. Or multiple banks may be screaming together. By tracking both the hits and misses in each bank, I end up with a bitmap history for the past 64 polls. If there are enough "1" bits in that bitmap to meet the threshold, then declare a storm for that bank. > > > @@ -1652,22 +1654,29 @@ static void mce_timer_fn(struct timer_list *t) > > else > > iv = min(iv * 2, round_jiffies_relative(check_interval * HZ)); > > > > - __this_cpu_write(mce_next_interval, iv); > > - __start_timer(t, iv); > > + if (mce_is_storm()) { > > + __start_timer(t, HZ); > > + } else { > > + __this_cpu_write(mce_next_interval, iv); > > + __start_timer(t, iv); > > + } > > ... this is where it becomes, hm, interesting: the check interval will > be halved if an error has been seen during this round but then if we're > in storm mode, that check interval doesn't matter - you'll run the timer > each second. > > Then you need to restructure this to check the storm condition and not > do anything to iv if storm. > > Or, am I missing something? I need to stare at this again to refresh my memory of what's going on here. This code may need pulling apart into a routine that is used for systems with no CMCI (or have CMCI disabled). Then the whole "divide the poll interval by two" when you see an error and double the interval when you don't see an error makes sense. For systems with CMCI ... I think just polling a one second interval until the storm is over makes sense. > > > diff --git a/arch/x86/kernel/cpu/mce/threshold.c b/arch/x86/kernel/cpu/mce/threshold.c > > index ef4e7bb5fd88..ecdf13f1bb7d 100644 > > --- a/arch/x86/kernel/cpu/mce/threshold.c > > +++ b/arch/x86/kernel/cpu/mce/threshold.c > > @@ -29,3 +29,115 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_threshold) > > trace_threshold_apic_exit(THRESHOLD_APIC_VECTOR); > > apic_eoi(); > > } > > + > > +/* > > + * banks: per-cpu, per-bank details > > + * stormy_bank_count: count of MC banks in storm state > > + * poll_mode: CPU is in poll mode > > + */ > > +struct mca_storm_desc { > > + struct storm_bank banks[MAX_NR_BANKS]; > > + u8 stormy_bank_count; > > + bool poll_mode; > > +}; > > Yeah, put the struct definition into internal.h pls. These are only used in threshold.c now. What's the point of them being in internal.h. That's for defintiones shared by multiple mcs/*.c files. Isn't it? But will move there if you still want this. > > > +static DEFINE_PER_CPU(struct mca_storm_desc, storm_desc); > > + > > +void mce_inherit_storm(unsigned int bank) > > +{ > > + struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc); > > + > > + storm->banks[bank].history = ~0ull; > > So upon inheriting a bank, you set its history that it has seen errors > each time? > > That's weird. Ideally the new CPU would inherit the precise state of the previous owner of this bank. But there's no easy way to track that as the bank is abanoned by the CPU going offline, and there is a free-for-all with remaining CPUs racing to claim ownership. It is known that this bank was in storm mode (because the threshold in the CTL2 bank register is set to CMCI_STORM_THRESHOLD). I went with "worst case" to make sure the new CPU didn't prematurely declare an end to the storm. I'll add a comment in mce_inherit_storm() to explain this. > > > + storm->banks[bank].timestamp = jiffies; > > +} > > + > > +bool mce_is_storm(void) > > That's a weird name. mce_get_storm_mode() perhaps? Sure. > > > +{ > > + return __this_cpu_read(storm_desc.poll_mode); > > +} > > + > > +void mce_set_storm(bool storm) > > mce_set_storm_mode() Also sure. > > > +{ > > + __this_cpu_write(storm_desc.poll_mode, storm); > > +} > > + > > +static void mce_handle_storm(unsigned int bank, bool on) > > +{ > > + switch (boot_cpu_data.x86_vendor) { > > + } > > +} > > ... > > > +void mce_track_storm(struct mce *mce) > > +{ > > + struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc); > > + unsigned long now = jiffies, delta; > > + unsigned int shift = 1; > > + u64 history = 0; > > + > > + /* > > + * When a bank is in storm mode it is polled once per second and > > + * the history mask will record about the last minute of poll results. > > + * If it is not in storm mode, then the bank is only checked when > > + * there is a CMCI interrupt. Check how long it has been since > > + * this bank was last checked, and adjust the amount of "shift" > > + * to apply to history. > > + */ > > + if (!storm->banks[mce->bank].storm) { > > Yeah, if this were > > if (!storm->banks[mce->bank].in_storm_mode) > > it would've been perfectly clear what the condition tests. Yup. > > > + delta = now - storm->banks[mce->bank].timestamp; > > + shift = (delta + HZ) / HZ; > > + } > > + > > + /* If it has been a long time since the last poll, clear history. */ > > + if (shift < 64) > > Use a properly named define instead of a naked number. Like this? #define NUM_HISTORY_BITS (sizeof(u64) * BITS_PER_BYTE) if (shift < NUM_HISTORY_BITS) > > ... > > Thx. > > -- > Regards/Gruss, > Boris. -Tony
On Mon, Oct 23, 2023 at 11:14:04AM -0700, Tony Luck wrote: > I want to track whether each bank is in storm mode, or not. But there's > no indication when a CMCI is delivered which bank was the source. Code > just has to scan all the banks, and might find more than one with an > error. While no bank is in polling mode, there isn't a set time interval > between scanning banks. Well, if no bank is in *storm* polling mode - I presume you mean that when you say "polling mode" - then we have the default polling interval of machine_check_poll() of INITIAL_CHECK_INTERVAL, i.e., 5 mins, I'd say. > A scan is just triggered when a CMCI happens. So it's non-trivial to > compute a rate. Would require saving a timestamp for every logged > error. So what I'm trying to establish first is, what our entry vectors into the storm code are. 1. You can enter storm tracking when you poll normally. I.e., machine_check_poll() each 5 mins once. mce_track_storm() tracks history only for MCI_STATUS_VAL=1b CEs, which is what I was wondering. It is hidden a bit down in that function. 2. As a result of a CMCI interrupt. That will call machine_check_poll() too and go down the same path. So that flow should be called out in the commit message so that the big picture is clear - how we're doing that storm tracking. Now, what is protecting this against concurrent runs of machine_check_poll()? Imagine the timer fires and a CMCI happens at the same time and on the same core. > In a simple case there's just one bank responsible for a ton of CMCI. > No need for complexity here, the count of interrupts from that bank will > hit a threshold and a storm is declared. Right. > But more complex scenarois are possible. Other banks may trigger small > numbers of CMCI. Not enough to call it a storm. Or multiple banks may > be screaming together. > > By tracking both the hits and misses in each bank, I end up with a > bitmap history for the past 64 polls. If there are enough "1" bits in > that bitmap to meet the threshold, then declare a storm for that > bank. Yap, I only want to be crystal clear on the flow and the entry points. > I need to stare at this again to refresh my memory of what's going on > here. This code may need pulling apart into a routine that is used for > systems with no CMCI (or have CMCI disabled). Then the whole "divide the > poll interval by two" when you see an error and double the interval > when you don't see an error makes sense. > > For systems with CMCI ... I think just polling a one second interval > until the storm is over makes sense. Ok. > These are only used in threshold.c now. What's the point of them > being in internal.h. That's for defintiones shared by multiple > mcs/*.c files. Isn't it? But will move there if you still want this. Structs hidden in .c files looks weird but ok. > Ideally the new CPU would inherit the precise state of the previous > owner of this bank. But there's no easy way to track that as the bank > is abanoned by the CPU going offline, and there is a free-for-all with > remaining CPUs racing to claim ownership. It is known that this bank > was in storm mode (because the threshold in the CTL2 bank register is > set to CMCI_STORM_THRESHOLD). > > I went with "worst case" to make sure the new CPU didn't prematurely > declare an end to the storm. > > I'll add a comment in mce_inherit_storm() to explain this. Yap, exactly. > Like this? > > #define NUM_HISTORY_BITS (sizeof(u64) * BITS_PER_BYTE) > > if (shift < NUM_HISTORY_BITS) Yap. Thx.
On Tue, Nov 14, 2023 at 08:23:24PM +0100, Borislav Petkov wrote: > On Mon, Oct 23, 2023 at 11:14:04AM -0700, Tony Luck wrote: > > I want to track whether each bank is in storm mode, or not. But there's > > no indication when a CMCI is delivered which bank was the source. Code > > just has to scan all the banks, and might find more than one with an > > error. While no bank is in polling mode, there isn't a set time interval > > between scanning banks. > > Well, if no bank is in *storm* polling mode - I presume you mean that > when you say "polling mode" - then we have the default polling interval > of machine_check_poll() of INITIAL_CHECK_INTERVAL, i.e., 5 mins, I'd > say. Initial state is that no banks are in storm mode. Errors logged in any bank are signaled with a CMCI to all logical CPUs that share that bank. But the "mce_banks_owned" per-cpu bitmask means that only one of those CPUs will get past the test_bit() check at the top of the loop in machine_check_poll(): for (i = 0; i < this_cpu_read(mce_num_banks); i++) { if (!mce_banks[i].ctl || !test_bit(i, *b)) continue; Before any storm happens, machine_check_poll() may only be called once a month, or less, when errors occur. When a storm is detected for a bank, that bank (and any others in storm mode) will be checked once per second. > > A scan is just triggered when a CMCI happens. So it's non-trivial to > > compute a rate. Would require saving a timestamp for every logged > > error. > > So what I'm trying to establish first is, what our entry vectors into > the storm code are. > > 1. You can enter storm tracking when you poll normally. I.e., > machine_check_poll() each 5 mins once. For a bank that doesn't support CMCI, then polling is the only way to find errors. You are right, these will feed into the history tracker, but while at 5-minute interval will not be able to trigger a storm. Since that 5-minute interval is halved every time error is found consecutively, it is possible at the 1-second poll interval to fill out enough bits to indicate a storm. I think I need to add some code to handle that case as it makes no sense to mess with the CMCI threshold in IA32_MCi_CTL2 for a bank that doesn't support CMCI. Probably will just skip tracking for any such banks. > mce_track_storm() tracks history only for MCI_STATUS_VAL=1b CEs, which > is what I was wondering. It is hidden a bit down in that function. > > 2. As a result of a CMCI interrupt. That will call machine_check_poll() > too and go down the same path. > > So that flow should be called out in the commit message so that the big > picture is clear - how we're doing that storm tracking. Yes. Those are both inputs to the storm tracker. I'll update the commit message to describe that. > Now, what is protecting this against concurrent runs of > machine_check_poll()? Imagine the timer fires and a CMCI happens at the > same time and on the same core. Aren't interrupts disabled while running the code after the timer fires? Whichever of the timer and the CMCI happens first will run. Second to arrive will pend the interrupt and be handled when interrupts are enabled as the first completes. ... rest of message trimmed, no open discussion items there. -Tony
On Tue, Nov 14, 2023 at 02:04:46PM -0800, Tony Luck wrote: > Before any storm happens, machine_check_poll() may only be called once > a month, or less, when errors occur. Err: [ 317.825546] mce: mc_poll_banks_default: CPU2 irq ctxt level: 1 [ 317.825585] mce: mc_poll_banks_default: CPU0 irq ctxt level: 1 [ 317.825585] mce: mc_poll_banks_default: CPU1 irq ctxt level: 1 [ 317.825586] mce: mc_poll_banks_default: CPU3 irq ctxt level: 1 [ 317.825586] mce: mc_poll_banks_default: CPU4 irq ctxt level: 1 [ 317.825586] mce: mc_poll_banks_default: CPU5 irq ctxt level: 1 [ 629.121536] mce: mc_poll_banks_default: CPU1 irq ctxt level: 1 [ 629.121536] mce: mc_poll_banks_default: CPU4 irq ctxt level: 1 [ 629.121560] mce: mc_poll_banks_default: CPU2 irq ctxt level: 1 [ 629.121561] mce: mc_poll_banks_default: CPU0 irq ctxt level: 1 [ 629.121561] mce: mc_poll_banks_default: CPU5 irq ctxt level: 1 [ 629.121569] mce: mc_poll_banks_default: CPU3 irq ctxt level: 1 [ 940.417507] mce: mc_poll_banks_default: CPU2 irq ctxt level: 1 [ 940.417508] mce: mc_poll_banks_default: CPU3 irq ctxt level: 1 [ 940.417508] mce: mc_poll_banks_default: CPU1 irq ctxt level: 1 [ 940.417508] mce: mc_poll_banks_default: CPU4 irq ctxt level: 1 [ 940.417509] mce: mc_poll_banks_default: CPU5 irq ctxt level: 1 [ 940.417508] mce: mc_poll_banks_default: CPU0 irq ctxt level: 1 ... That's from my coffeelake test box. The irq context level thing says we're in softirq context when the polling happens. > When a storm is detected for a bank, that bank (and any others in storm > mode) will be checked once per second. Ok. > For a bank that doesn't support CMCI, then polling is the only way > to find errors. You are right, these will feed into the history > tracker, but while at 5-minute interval will not be able to trigger > a storm. Yes. But you need to call into the storm handling code somehow. So you do that from the polling code. And if the machine supports CMCI, you do the same - call the polling code which then does the storm check. > Since that 5-minute interval is halved every time error is > found consecutively, it is possible at the 1-second poll interval to > fill out enough bits to indicate a storm. I think I need to add some > code to handle that case as it makes no sense to mess with the CMCI > threshold in IA32_MCi_CTL2 for a bank that doesn't support CMCI. > Probably will just skip tracking for any such banks. Ok. > Aren't interrupts disabled while running the code after the timer fires? No, see above. > Whichever of the timer and the CMCI happens first will run. Second to > arrive will pend the interrupt and be handled when interrupts are > enabled as the first completes. So I still don't like the timer calling machine_check_poll() and cmci_mc_poll_banks() doing the same without any proper synchronization between the two. Yes, when you get a CMCI interrupt, you poll and do the call the storm code. Now what happens if the polling runs from softirq context and you get a CMCI interrupt at exactly the same time. I.e., is machine_check_poll() reentrant and audited properly? I hope I'm making more sense. Thx.
On Tue, Nov 21, 2023 at 12:54:48PM +0100, Borislav Petkov wrote: > On Tue, Nov 14, 2023 at 02:04:46PM -0800, Tony Luck wrote: > > Whichever of the timer and the CMCI happens first will run. Second to > > arrive will pend the interrupt and be handled when interrupts are > > enabled as the first completes. > > So I still don't like the timer calling machine_check_poll() and > cmci_mc_poll_banks() doing the same without any proper synchronization > between the two. But it isn't doing the same thing. The timer calls: machine_check_poll(0, this_cpu_ptr(&mce_poll_banks)); and cmci_mc_poll_banks() calls: machine_check_poll(0, this_cpu_ptr(&mce_poll_banks)); A bank is either in the bitmap of banks to poll from the timer, or in one of the per-cpu bitmaps of banks "owned" by that CPU to be checked when a CMCI occurs. But it can't be in both. > Yes, when you get a CMCI interrupt, you poll and do the call the storm > code. Now what happens if the polling runs from softirq context and you > get a CMCI interrupt at exactly the same time. I.e., is > machine_check_poll() reentrant and audited properly? So nothing bad happens. If Linux was polling some set of banks from the timer and is interrupted by CMCI, the interrupt will check some disjoint set of banks. All the history tracking code is done per-bank, so there is no overlap. > I hope I'm making more sense. Yes. Totally making sense. I was under the mistaken impression that the mce timers used TIMER_IRQSAFE and the nested CMCI while processing a timed poll couldn't happen. So I learned something here too. I'll think of some comment to add to the history tracking code to summarize this thread. -Tony
On Mon, Nov 27, 2023 at 11:50:26AM -0800, Tony Luck wrote: > On Tue, Nov 21, 2023 at 12:54:48PM +0100, Borislav Petkov wrote: > > On Tue, Nov 14, 2023 at 02:04:46PM -0800, Tony Luck wrote: > But it isn't doing the same thing. The timer calls: > > machine_check_poll(0, this_cpu_ptr(&mce_poll_banks)); > > and cmci_mc_poll_banks() calls: > > machine_check_poll(0, this_cpu_ptr(&mce_poll_banks)); Bah ... I've cut & pasted the same thing ... but I think there are separate bit maps ... maybe I'm wrong. Will go back and look again. -Tony
On Mon, Nov 27, 2023 at 12:14:28PM -0800, Tony Luck wrote: > On Mon, Nov 27, 2023 at 11:50:26AM -0800, Tony Luck wrote: > > On Tue, Nov 21, 2023 at 12:54:48PM +0100, Borislav Petkov wrote: > > > On Tue, Nov 14, 2023 at 02:04:46PM -0800, Tony Luck wrote: > > But it isn't doing the same thing. The timer calls: > > > > machine_check_poll(0, this_cpu_ptr(&mce_poll_banks)); > > > > and cmci_mc_poll_banks() calls: > > > > machine_check_poll(0, this_cpu_ptr(&mce_poll_banks)); machine_check_poll(0, this_cpu_ptr(&mce_banks_owned)); > > Bah ... I've cut & pasted the same thing ... but I think there > are separate bit maps ... maybe I'm wrong. Will go back and > look again. > > -Tony
On 11/27/2023 7:42 PM, Tony Luck wrote: > On Mon, Nov 27, 2023 at 12:14:28PM -0800, Tony Luck wrote: >> On Mon, Nov 27, 2023 at 11:50:26AM -0800, Tony Luck wrote: >>> On Tue, Nov 21, 2023 at 12:54:48PM +0100, Borislav Petkov wrote: >>>> On Tue, Nov 14, 2023 at 02:04:46PM -0800, Tony Luck wrote: >>> But it isn't doing the same thing. The timer calls: >>> >>> machine_check_poll(0, this_cpu_ptr(&mce_poll_banks)); >>> >>> and cmci_mc_poll_banks() calls: >>> >>> machine_check_poll(0, this_cpu_ptr(&mce_poll_banks)); > > machine_check_poll(0, this_cpu_ptr(&mce_banks_owned)); > >> >> Bah ... I've cut & pasted the same thing ... but I think there >> are separate bit maps ... maybe I'm wrong. Will go back and >> look again. >> This mutually exclusive behavior is not enforced on AMD systems. A bank can be polled and signaled using hardware interrupts. I've been thinking to change this in order to save polling cycles. Now it seems there's another reason. :) Thanks, Yazen
On Mon, Nov 27, 2023 at 04:42:02PM -0800, Tony Luck wrote: > On Mon, Nov 27, 2023 at 12:14:28PM -0800, Tony Luck wrote: > > On Mon, Nov 27, 2023 at 11:50:26AM -0800, Tony Luck wrote: > > > On Tue, Nov 21, 2023 at 12:54:48PM +0100, Borislav Petkov wrote: > > > > On Tue, Nov 14, 2023 at 02:04:46PM -0800, Tony Luck wrote: > > > But it isn't doing the same thing. The timer calls: > > > > > > machine_check_poll(0, this_cpu_ptr(&mce_poll_banks)); > > > > > > and cmci_mc_poll_banks() calls: > > > > > > machine_check_poll(0, this_cpu_ptr(&mce_poll_banks)); > > machine_check_poll(0, this_cpu_ptr(&mce_banks_owned)); Hmm, so I applied your v10 and this call with mce_banks_owned is done in cmci_recheck() only. Which is on some init path. The thresholding interrupt calls it too. The timer ends up calling mc_poll_banks_default() which does machine_check_poll(0, this_cpu_ptr(&mce_poll_banks)); I presume we don't do: if (!cmci_supported(&banks)) { mc_poll_banks = cmci_mc_poll_banks; return; } usually on Intel. And even if we did, cmci_mc_poll_banks() calls machine_check_poll(0, this_cpu_ptr(&mce_poll_banks)); too. So regardless what machine you have, you do call the mc_poll_banks pointer which in both cases does machine_check_poll(0, this_cpu_ptr(&mce_poll_banks)); The *thresholding* interrupt does machine_check_poll(0, this_cpu_ptr(&mce_banks_owned)); and you're saying mce_poll_banks and mce_banks_owned are disjoint. That's what you mean, right? Because if so, yes, that makes sense. If the sets of MCA banks polled and handled in the thresholding interrupt are disjoint, we should be ok. Thx.
> and you're saying > > mce_poll_banks and mce_banks_owned > > are disjoint. > > That's what you mean, right? Yes. A bank is one or the other bitmap, but not both. > Because if so, yes, that makes sense. If the sets of MCA banks polled > and handled in the thresholding interrupt are disjoint, we should be ok. Yes. -Tony
diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h index 616732ec16f7..4aa957442c8d 100644 --- a/arch/x86/kernel/cpu/mce/internal.h +++ b/arch/x86/kernel/cpu/mce/internal.h @@ -54,7 +54,45 @@ static inline void intel_clear_lmce(void) { } static inline bool intel_filter_mce(struct mce *m) { return false; } #endif -void mce_timer_kick(unsigned long interval); +void mce_timer_kick(bool storm); + +#ifdef CONFIG_X86_MCE_THRESHOLD +void cmci_storm_begin(unsigned int bank); +void cmci_storm_end(unsigned int bank); +void mce_track_storm(struct mce *mce); +void mce_inherit_storm(unsigned int bank); +bool mce_is_storm(void); +void mce_set_storm(bool storm); +#else +static inline void cmci_storm_begin(unsigned int bank) {} +static inline void cmci_storm_end(unsigned int bank) {} +static inline void mce_track_storm(struct mce *mce) {} +static inline void mce_inherit_storm(unsigned int bank) {} +static inline bool mce_is_storm(void) { return false; } +static inline void mce_set_storm(bool storm) {} +#endif + +/* + * history: bitmask tracking whether errors were seen or not seen in + * the most recent polls of a bank. + * timestamp: last time (in jiffies) that the bank was polled + * storm: Is this bank in storm mode? + */ +struct storm_bank { + u64 history; + u64 timestamp; + bool storm; +}; + +/* How many errors within the history buffer mark the start of a storm. */ +#define STORM_BEGIN_THRESHOLD 5 + +/* + * How many polls of machine check bank without an error before declaring + * the storm is over. Since it is tracked by the bitmaks in the history + * field of struct storm_bank the mask is 30 bits [0 ... 29]. + */ +#define STORM_END_POLL_THRESHOLD 29 #ifdef CONFIG_ACPI_APEI int apei_write_mce(struct mce *m); diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index f6e87443b37a..7c931f0c9251 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -680,6 +680,8 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b) barrier(); m.status = mce_rdmsrl(mca_msr_reg(i, MCA_STATUS)); + mce_track_storm(&m); + /* If this entry is not valid, ignore it */ if (!(m.status & MCI_STATUS_VAL)) continue; @@ -1652,22 +1654,29 @@ static void mce_timer_fn(struct timer_list *t) else iv = min(iv * 2, round_jiffies_relative(check_interval * HZ)); - __this_cpu_write(mce_next_interval, iv); - __start_timer(t, iv); + if (mce_is_storm()) { + __start_timer(t, HZ); + } else { + __this_cpu_write(mce_next_interval, iv); + __start_timer(t, iv); + } } /* - * Ensure that the timer is firing in @interval from now. + * When a storm starts on any bank on this CPU, switch to polling + * once per second. When the storm ends, revert to the default + * polling interval. */ -void mce_timer_kick(unsigned long interval) +void mce_timer_kick(bool storm) { struct timer_list *t = this_cpu_ptr(&mce_timer); - unsigned long iv = __this_cpu_read(mce_next_interval); - __start_timer(t, interval); + mce_set_storm(storm); - if (interval < iv) - __this_cpu_write(mce_next_interval, interval); + if (storm) + __start_timer(t, HZ); + else + __this_cpu_write(mce_next_interval, check_interval * HZ); } /* Must not be called in IRQ context where del_timer_sync() can deadlock */ diff --git a/arch/x86/kernel/cpu/mce/threshold.c b/arch/x86/kernel/cpu/mce/threshold.c index ef4e7bb5fd88..ecdf13f1bb7d 100644 --- a/arch/x86/kernel/cpu/mce/threshold.c +++ b/arch/x86/kernel/cpu/mce/threshold.c @@ -29,3 +29,115 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_threshold) trace_threshold_apic_exit(THRESHOLD_APIC_VECTOR); apic_eoi(); } + +/* + * banks: per-cpu, per-bank details + * stormy_bank_count: count of MC banks in storm state + * poll_mode: CPU is in poll mode + */ +struct mca_storm_desc { + struct storm_bank banks[MAX_NR_BANKS]; + u8 stormy_bank_count; + bool poll_mode; +}; + +static DEFINE_PER_CPU(struct mca_storm_desc, storm_desc); + +void mce_inherit_storm(unsigned int bank) +{ + struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc); + + storm->banks[bank].history = ~0ull; + storm->banks[bank].timestamp = jiffies; +} + +bool mce_is_storm(void) +{ + return __this_cpu_read(storm_desc.poll_mode); +} + +void mce_set_storm(bool storm) +{ + __this_cpu_write(storm_desc.poll_mode, storm); +} + +static void mce_handle_storm(unsigned int bank, bool on) +{ + switch (boot_cpu_data.x86_vendor) { + } +} + +void cmci_storm_begin(unsigned int bank) +{ + struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc); + + __set_bit(bank, this_cpu_ptr(mce_poll_banks)); + storm->banks[bank].storm = true; + + /* + * If this is the first bank on this CPU to enter storm mode + * start polling. + */ + if (++storm->stormy_bank_count == 1) + mce_timer_kick(true); +} + +void cmci_storm_end(unsigned int bank) +{ + struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc); + + __clear_bit(bank, this_cpu_ptr(mce_poll_banks)); + storm->banks[bank].history = 0; + storm->banks[bank].storm = false; + + /* If no banks left in storm mode, stop polling. */ + if (!this_cpu_dec_return(storm_desc.stormy_bank_count)) + mce_timer_kick(false); +} + +void mce_track_storm(struct mce *mce) +{ + struct mca_storm_desc *storm = this_cpu_ptr(&storm_desc); + unsigned long now = jiffies, delta; + unsigned int shift = 1; + u64 history = 0; + + /* + * When a bank is in storm mode it is polled once per second and + * the history mask will record about the last minute of poll results. + * If it is not in storm mode, then the bank is only checked when + * there is a CMCI interrupt. Check how long it has been since + * this bank was last checked, and adjust the amount of "shift" + * to apply to history. + */ + if (!storm->banks[mce->bank].storm) { + delta = now - storm->banks[mce->bank].timestamp; + shift = (delta + HZ) / HZ; + } + + /* If it has been a long time since the last poll, clear history. */ + if (shift < 64) + history = storm->banks[mce->bank].history << shift; + + storm->banks[mce->bank].timestamp = now; + + /* History keeps track of corrected errors. VAL=1 && UC=0 */ + if ((mce->status & MCI_STATUS_VAL) && mce_is_correctable(mce)) + history |= 1; + + storm->banks[mce->bank].history = history; + + if (storm->banks[mce->bank].storm) { + if (history & GENMASK_ULL(STORM_END_POLL_THRESHOLD, 0)) + return; + printk_deferred(KERN_NOTICE "CPU%d BANK%d CMCI storm subsided\n", smp_processor_id(), mce->bank); + mce_handle_storm(mce->bank, false); + cmci_storm_end(mce->bank); + } else { + if (hweight64(history) < STORM_BEGIN_THRESHOLD) + return; + printk_deferred(KERN_NOTICE "CPU%d BANK%d CMCI storm detected\n", smp_processor_id(), mce->bank); + mce_handle_storm(mce->bank, true); + cmci_storm_begin(mce->bank); + } +}
This is the core functionality to track CMCI storms at the machine check bank granularity. Subsequent patches will add the vendor specific hooks to supply input to the storm detection and take actions on the start/end of a storm. Maintain a bitmap history for each bank showing whether the bank logged an corrected error or not each time it is polled. In normal operation the interval between polls of this banks determines how far to shift the history. The 64 bit width corresponds to about one second. When a storm is observed a CPU vendor specific action is taken to reduce or stop CMCI from the bank that is the source of the storm. The bank is added to the bitmap of banks for this CPU to poll. The polling rate is increased to once per second. During a storm each bit in the history indicates the status of the bank each time it is polled. Thus the history covers just over a minute. Declare a storm for that bank if the number of corrected interrupts seen in that history is above some threshold (defined as 5 in this series, could be tuned later if there is data to suggest a better value). A storm on a bank ends if enough consecutive polls of the bank show no corrected errors (defined as 30, may also change). That calls the CPU vendor specific function to revert to normal operational mode, and changes the polling rate back to the default. Signed-off-by: Tony Luck <tony.luck@intel.com> --- v8: Updated with review comments from Yazen. Link: https://lore.kernel.org/r/c76723df-f2f1-4888-9e05-61917145503c@amd.com v9: Fix lkp randconfig build issue when neither INTEL nor AMD configured --- arch/x86/kernel/cpu/mce/internal.h | 40 +++++++++- arch/x86/kernel/cpu/mce/core.c | 25 +++++-- arch/x86/kernel/cpu/mce/threshold.c | 112 ++++++++++++++++++++++++++++ 3 files changed, 168 insertions(+), 9 deletions(-)