Message ID | 20231019033323.54147-4-yangyicong@huawei.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | sched/fair: Scan cluster before scanning LLC in wake-up path | expand |
On Thu, 19 Oct 2023 at 05:36, Yicong Yang <yangyicong@huawei.com> wrote: > > From: Yicong Yang <yangyicong@hisilicon.com> > > Chen Yu reports a hackbench regression of cluster wakeup when > hackbench threads equal to the CPU number [1]. Analysis shows > it's because we wake up more on the target CPU even if the > prev_cpu is a good wakeup candidate and leads to the decrease > of the CPU utilization. > > Generally if the task's prev_cpu is idle we'll wake up the task > on it without scanning. On cluster machines we'll try to wake up > the task in the same cluster of the target for better cache > affinity, so if the prev_cpu is idle but not sharing the same > cluster with the target we'll still try to find an idle CPU within > the cluster. This will improve the performance at low loads on > cluster machines. But in the issue above, if the prev_cpu is idle > but not in the cluster with the target CPU, we'll try to scan an > idle one in the cluster. But since the system is busy, we're > likely to fail the scanning and use target instead, even if > the prev_cpu is idle. Then leads to the regression. > > This patch solves this in 2 steps: > o record the prev_cpu/recent_used_cpu if they're good wakeup > candidates but not sharing the cluster with the target. > o on scanning failure use the prev_cpu/recent_used_cpu if > they're recorded as idle > > [1] https://lore.kernel.org/all/ZGzDLuVaHR1PAYDt@chenyu5-mobl1/ > > Reported-by: Chen Yu <yu.c.chen@intel.com> > Closes: https://lore.kernel.org/all/ZGsLy83wPIpamy6x@chenyu5-mobl1/ > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> > --- > kernel/sched/fair.c | 17 ++++++++++++++++- > 1 file changed, 16 insertions(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 02d842df5294..d508d1999ecc 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7346,7 +7346,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > bool has_idle_core = false; > struct sched_domain *sd; > unsigned long task_util, util_min, util_max; > - int i, recent_used_cpu; > + int i, recent_used_cpu, prev_aff = -1; > > /* > * On asymmetric system, update task utilization because we will check > @@ -7379,6 +7379,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > > if (cpus_share_resources(prev, target)) > return prev; > + > + prev_aff = prev; > } > > /* > @@ -7411,6 +7413,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > > if (cpus_share_resources(recent_used_cpu, target)) > return recent_used_cpu; > + } else { > + recent_used_cpu = -1; > } > > /* > @@ -7451,6 +7455,17 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > if ((unsigned)i < nr_cpumask_bits) > return i; > > + /* > + * For cluster machines which have lower sharing cache like L2 or > + * LLC Tag, we tend to find an idle CPU in the target's cluster > + * first. But prev_cpu or recent_used_cpu may also be a good candidate, > + * use them if possible when no idle CPU found in select_idle_cpu(). > + */ > + if ((unsigned int)prev_aff < nr_cpumask_bits) > + return prev_aff; > + if ((unsigned int)recent_used_cpu < nr_cpumask_bits) > + return recent_used_cpu; > + > return target; > } > > -- > 2.24.0 >
On 2023-10-19 at 11:33:23 +0800, Yicong Yang wrote: > From: Yicong Yang <yangyicong@hisilicon.com> > > Chen Yu reports a hackbench regression of cluster wakeup when > hackbench threads equal to the CPU number [1]. Analysis shows > it's because we wake up more on the target CPU even if the > prev_cpu is a good wakeup candidate and leads to the decrease > of the CPU utilization. > > Generally if the task's prev_cpu is idle we'll wake up the task > on it without scanning. On cluster machines we'll try to wake up > the task in the same cluster of the target for better cache > affinity, so if the prev_cpu is idle but not sharing the same > cluster with the target we'll still try to find an idle CPU within > the cluster. This will improve the performance at low loads on > cluster machines. But in the issue above, if the prev_cpu is idle > but not in the cluster with the target CPU, we'll try to scan an > idle one in the cluster. But since the system is busy, we're > likely to fail the scanning and use target instead, even if > the prev_cpu is idle. Then leads to the regression. > > This patch solves this in 2 steps: > o record the prev_cpu/recent_used_cpu if they're good wakeup > candidates but not sharing the cluster with the target. > o on scanning failure use the prev_cpu/recent_used_cpu if > they're recorded as idle > > [1] https://lore.kernel.org/all/ZGzDLuVaHR1PAYDt@chenyu5-mobl1/ > > Reported-by: Chen Yu <yu.c.chen@intel.com> > Closes: https://lore.kernel.org/all/ZGsLy83wPIpamy6x@chenyu5-mobl1/ > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com> > Tested on 24 CPUs Jacobsville machine, 4 CPUs in one cluster sharing L2 Cache. The baseline is sched/core on top of commit a36e5741bdc5 ("sched: Fix stop_one_cpu_nowait() vs hotplug"), and compared with the whole patch set applied. I did not see any regression but improvement on hackbench, please feel free to add: Tested-and-reviewed-by: Chen Yu <yu.c.chen@intel.com> hackbench ========= case load baseline(std%) compare%( std%) process-pipe 1-groups 1.00 ( 0.26) +6.02 ( 1.53) process-pipe 2-groups 1.00 ( 1.03) +1.97 ( 0.70) process-pipe 4-groups 1.00 ( 0.26) +1.80 ( 3.27) process-sockets 1-groups 1.00 ( 0.29) +1.63 ( 0.86) process-sockets 2-groups 1.00 ( 1.17) +2.59 ( 0.35) process-sockets 4-groups 1.00 ( 1.07) +3.86 ( 0.51) threads-pipe 1-groups 1.00 ( 0.79) +8.17 ( 0.48) threads-pipe 2-groups 1.00 ( 0.65) +6.34 ( 0.23) threads-pipe 4-groups 1.00 ( 0.38) +4.61 ( 1.04) threads-sockets 1-groups 1.00 ( 0.73) +0.80 ( 0.35) threads-sockets 2-groups 1.00 ( 1.09) +2.81 ( 1.18) threads-sockets 4-groups 1.00 ( 0.67) +2.30 ( 0.20) netperf ======= case load baseline(std%) compare%( std%) TCP_RR 6-threads 1.00 ( 0.48) +3.97 ( 0.50) TCP_RR 12-threads 1.00 ( 0.11) +3.83 ( 0.15) TCP_RR 18-threads 1.00 ( 0.18) +7.53 ( 0.18) TCP_RR 24-threads 1.00 ( 0.34) +2.40 ( 0.77) TCP_RR 30-threads 1.00 ( 10.39) +2.22 ( 11.51) TCP_RR 36-threads 1.00 ( 10.87) +2.06 ( 16.71) TCP_RR 42-threads 1.00 ( 14.04) +2.10 ( 12.86) TCP_RR 48-threads 1.00 ( 5.89) +2.15 ( 5.54) UDP_RR 6-threads 1.00 ( 0.20) +2.99 ( 0.55) UDP_RR 12-threads 1.00 ( 0.18) +3.65 ( 0.27) UDP_RR 18-threads 1.00 ( 0.34) +6.62 ( 0.23) UDP_RR 24-threads 1.00 ( 0.60) -1.73 ( 12.54) UDP_RR 30-threads 1.00 ( 9.70) -0.62 ( 14.34) UDP_RR 36-threads 1.00 ( 11.80) -0.05 ( 12.27) UDP_RR 42-threads 1.00 ( 15.35) -0.05 ( 12.26) UDP_RR 48-threads 1.00 ( 5.58) -0.12 ( 5.73) tbench ====== case load baseline(std%) compare%( std%) loopback 6-threads 1.00 ( 0.29) +2.51 ( 0.24) loopback 12-threads 1.00 ( 0.08) +2.90 ( 0.47) loopback 18-threads 1.00 ( 0.06) +6.85 ( 0.07) loopback 24-threads 1.00 ( 0.20) +1.85 ( 0.14) loopback 30-threads 1.00 ( 0.15) +1.37 ( 0.07) loopback 36-threads 1.00 ( 0.12) +1.34 ( 0.07) loopback 42-threads 1.00 ( 0.09) +0.91 ( 0.04) loopback 48-threads 1.00 ( 0.11) +0.88 ( 0.05) schbench ======== case load baseline(std%) compare%( std%) normal 1-mthreads 1.00 ( 2.67) -1.89 ( 0.00) normal 2-mthreads 1.00 ( 0.00) +0.00 ( 0.00) normal 4-mthreads 1.00 ( 8.08) +12.86 ( 2.32) thanks, Chenyu
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 02d842df5294..d508d1999ecc 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7346,7 +7346,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) bool has_idle_core = false; struct sched_domain *sd; unsigned long task_util, util_min, util_max; - int i, recent_used_cpu; + int i, recent_used_cpu, prev_aff = -1; /* * On asymmetric system, update task utilization because we will check @@ -7379,6 +7379,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if (cpus_share_resources(prev, target)) return prev; + + prev_aff = prev; } /* @@ -7411,6 +7413,8 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if (cpus_share_resources(recent_used_cpu, target)) return recent_used_cpu; + } else { + recent_used_cpu = -1; } /* @@ -7451,6 +7455,17 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) if ((unsigned)i < nr_cpumask_bits) return i; + /* + * For cluster machines which have lower sharing cache like L2 or + * LLC Tag, we tend to find an idle CPU in the target's cluster + * first. But prev_cpu or recent_used_cpu may also be a good candidate, + * use them if possible when no idle CPU found in select_idle_cpu(). + */ + if ((unsigned int)prev_aff < nr_cpumask_bits) + return prev_aff; + if ((unsigned int)recent_used_cpu < nr_cpumask_bits) + return recent_used_cpu; + return target; }