Message ID | cover.1697816692.git.raghavendra.kt@amd.com (mailing list archive) |
---|---|
State | New |
Headers | show |
On 10/20/2023 9:27 PM, Raghavendra K T wrote: > NUMA balancing code that updates PTEs by allowing unconditional scan > based on the value of processes' mm numa_scan_seq is not perfect. > > More description is in patch1. > > Have used the below patch to identify the corner case. > > Detailed Result: (Only part of the result is updated > in patch1 to save space in commit log) > > Detailed Result: > > SUT: AMD EPYC Milan with 2 NUMA nodes 256 cpus. > > Base kernel: upstream 6.6-rc6 (dd72f9c7e512) with Mels patch-series > from tip/sched/core [1] applied. > > Summary: Some benchmarks imrove. There is increase in system > time due to additional scanning. But elapsed time shows gain. > > However there is also some overhead seen for benchmarks like NUMA01. > > kernbench > ========== base patched > Amean user-128 13799.58 ( 0.00%) 13789.86 * 0.07%* > Amean syst-128 3280.80 ( 0.00%) 3249.67 * 0.95%* > Amean elsp-128 165.09 ( 0.00%) 164.78 * 0.19%* > > Duration User 41404.28 41375.08 > Duration System 9862.22 9768.48 > Duration Elapsed 519.87 518.72 > > Ops NUMA PTE updates 1041416.00 831536.00 > Ops NUMA hint faults 263296.00 220966.00 > Ops NUMA pages migrated 258021.00 212769.00 > Ops AutoNUMA cost 1328.67 1114.69 > > autonumabench > > NUMA01_THREADLOCAL > ================== > Amean syst-NUMA01_THREADLOCAL 10.65 ( 0.00%) 26.47 *-148.59%* > Amean elsp-NUMA01_THREADLOCAL 81.79 ( 0.00%) 67.74 * 17.18%* > > Duration User 54832.73 47379.67 > Duration System 75.00 185.75 > Duration Elapsed 576.72 476.09 > > Ops NUMA PTE updates 394429.00 11121044.00 > Ops NUMA hint faults 1001.00 8906404.00 > Ops NUMA pages migrated 288.00 2998694.00 > Ops AutoNUMA cost 7.77 44666.84 > > NUMA01 > ===== > Amean syst-NUMA01 31.97 ( 0.00%) 52.95 * -65.62%* > Amean elsp-NUMA01 143.16 ( 0.00%) 150.81 * -5.34%* > > Duration User 84839.49 91342.19 > Duration System 224.26 371.12 > Duration Elapsed 1005.64 1059.01 > > Ops NUMA PTE updates 33929508.00 50116313.00 > Ops NUMA hint faults 34993820.00 52895783.00 > Ops NUMA pages migrated 5456115.00 7441228.00 > Ops AutoNUMA cost 175310.27 264971.11 > > NUMA02 > ========= > Amean syst-NUMA02 0.86 ( 0.00%) 0.86 * -0.50%* > Amean elsp-NUMA02 3.99 ( 0.00%) 3.82 * 4.40%* > > Duration User 1186.06 1092.07 > Duration System 6.44 6.47 > Duration Elapsed 31.28 30.30 > > Ops NUMA PTE updates 776.00 731.00 > Ops NUMA hint faults 527.00 490.00 > Ops NUMA pages migrated 183.00 153.00 > Ops AutoNUMA cost 2.64 2.46 > > Link: https://lore.kernel.org/linux-mm/ZSXF3AFZgIld1meX@gmail.com/T/ > Forgot to add skip_vma_count trace results: autonumabench: numa01_THREAD_LOCAL 3 iterations base: inaccessible:13133 pid_inactive:15807 scan_delay:471 seq_completed:50 shared_ro:6983 unsuitable:3917 patched: inaccessible:4727 pid_inactive:5119 scan_delay:455 seq_completed:7 shared_ro:2551 unsuitable:5402 > Raghavendra K T (1): > sched/numa: Fix mm numa_scan_seq based unconditional scan > > include/linux/mm_types.h | 3 +++ > kernel/sched/fair.c | 4 +++- > 2 files changed, 6 insertions(+), 1 deletion(-) > > ---8<--- > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h > index 010ba1b7cb0e..a4870b01c8a1 100644 > --- a/include/trace/events/sched.h > +++ b/include/trace/events/sched.h > @@ -10,6 +10,30 @@ > #include <linux/tracepoint.h> > #include <linux/binfmts.h> > > +TRACE_EVENT(sched_vma_start_seq, > + > + TP_PROTO(struct task_struct *t, struct vm_area_struct *vma, int start_seq), > + > + TP_ARGS(t, vma, start_seq), > + > + TP_STRUCT__entry( > + __array( char, comm, TASK_COMM_LEN ) > + __field( pid_t, pid ) > + __field( void *, vma ) > + __field( int, start_seq ) > + ), > + > + TP_fast_assign( > + memcpy(__entry->comm, t->comm, TASK_COMM_LEN); > + __entry->pid = t->pid; > + __entry->vma = vma; > + __entry->start_seq = start_seq; > + ), > + > + TP_printk("comm=%s pid=%d vma = %px start_seq=%d", __entry->comm, __entry->pid, __entry->vma, > + __entry->start_seq) > +); > + > /* > * Tracepoint for calling kthread_stop, performed to end a kthread: > */ > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index c8af3a7ccba7..e0c16ea8470b 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3335,6 +3335,7 @@ static void task_numa_work(struct callback_head *work) > continue; > > vma->numab_state->start_scan_seq = mm->numa_scan_seq; > + trace_sched_vma_start_seq(p, vma, mm->numa_scan_seq); > > vma->numab_state->next_scan = now + > msecs_to_jiffies(sysctl_numa_balancing_scan_delay); > >
On 10/20/2023 9:27 PM, Raghavendra K T wrote: > NUMA balancing code that updates PTEs by allowing unconditional scan > based on the value of processes' mm numa_scan_seq is not perfect. > > More description is in patch1. > > Have used the below patch to identify the corner case. > > Detailed Result: (Only part of the result is updated > in patch1 to save space in commit log) > Gentle ping to check if there are any concerns / comments on the patch :) Thanks and Regards - Raghu
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 010ba1b7cb0e..a4870b01c8a1 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -10,6 +10,30 @@ #include <linux/tracepoint.h> #include <linux/binfmts.h> +TRACE_EVENT(sched_vma_start_seq, + + TP_PROTO(struct task_struct *t, struct vm_area_struct *vma, int start_seq), + + TP_ARGS(t, vma, start_seq), + + TP_STRUCT__entry( + __array( char, comm, TASK_COMM_LEN ) + __field( pid_t, pid ) + __field( void *, vma ) + __field( int, start_seq ) + ), + + TP_fast_assign( + memcpy(__entry->comm, t->comm, TASK_COMM_LEN); + __entry->pid = t->pid; + __entry->vma = vma; + __entry->start_seq = start_seq; + ), + + TP_printk("comm=%s pid=%d vma = %px start_seq=%d", __entry->comm, __entry->pid, __entry->vma, + __entry->start_seq) +); + /* * Tracepoint for calling kthread_stop, performed to end a kthread: */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c8af3a7ccba7..e0c16ea8470b 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3335,6 +3335,7 @@ static void task_numa_work(struct callback_head *work) continue; vma->numab_state->start_scan_seq = mm->numa_scan_seq; + trace_sched_vma_start_seq(p, vma, mm->numa_scan_seq); vma->numab_state->next_scan = now + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);