Message ID | 20171004161850.wgnu73dokpjfyfdk@hirez.programming.kicks-ass.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, 2017-10-04 at 18:18 +0200, Peter Zijlstra wrote: > On Tue, Oct 03, 2017 at 10:39:32AM +0200, Peter Zijlstra wrote: > > So I was waiting for Rik, who promised to run a bunch of NUMA > > workloads > > over the weekend. > > > > The trivial thing regresses a wee bit on the overloaded case, I've > > not > > yet tried to fix it. > > WA_IDLE is my 'old' patch and what you all tested, WA_WEIGHT is the > addition -- based on the old scheme -- that I've tried in order to > lift > the overloaded case (including hackbench). > > Its not an unconditional win, but I'm tempted to default enable > WA_WEIGHT too (I've not done NO_WA_IDLE && WA_WEIGHT runs). Enabling both makes sense to me. We have four cases to deal with: - mostly idle system, in that case we don't really care, since select_idle_sibling will find an idle core anywhere - partially loaded system (say 1/2 or 2/3 full), in that case WA_IDLE will be a great policy to help locate an idle CPU - fully loaded system, in this case either policy works well - overloaded system, in this case WA_WEIGHT seems to do the trick, assuming load balancing results in largely similar loads between cores inside each LLC The big danger is affine wakeups messing up the balance the load balancer works on, with the two mechanisms messing up each other's placement. However, there seems to be very little we can actually do about that, without the unacceptable overhead of examining the instantaneous loads on every CPU in an LLC - otherwise we end up either overshooting, or not taking advantage of idle CPUs, due to the use of cached load values.
On Wed, 04 Oct, at 06:18:50PM, Peter Zijlstra wrote: > On Tue, Oct 03, 2017 at 10:39:32AM +0200, Peter Zijlstra wrote: > > So I was waiting for Rik, who promised to run a bunch of NUMA workloads > > over the weekend. > > > > The trivial thing regresses a wee bit on the overloaded case, I've not > > yet tried to fix it. > > WA_IDLE is my 'old' patch and what you all tested, WA_WEIGHT is the > addition -- based on the old scheme -- that I've tried in order to lift > the overloaded case (including hackbench). My results (2 nodes, 12 cores/node, 2 threads/core) show that you've pretty much restored hackbench performance to v4.12. However, it's a regression against v4.13 for hackbench-process-pipes (I'm guessing the v4.13 improvement is due to Rik's original patches). The last two result columns are your latest patch with NO_WA_WEIGHT and then with WA_WEIGHT enabled. (I hope you've all got wide terminals) hackbench-process-pipes 4.12.0 4.13.0 4.13.0 4.13.0 4.13.0 4.13.0 vanilla vanilla peterz-fix rik-fixpeterz-fix-v2-no-wa-weightpeterz-fix-v2-wa-weight Amean 1 1.1600 ( 0.00%) 1.6037 ( -38.25%) 1.0727 ( 7.53%) 1.0200 ( 12.07%) 1.0837 ( 6.58%) 1.1110 ( 4.22%) Amean 4 2.4207 ( 0.00%) 2.2300 ( 7.88%) 2.0520 ( 15.23%) 1.9483 ( 19.51%) 2.0623 ( 14.80%) 2.2807 ( 5.78%) Amean 7 5.4140 ( 0.00%) 3.2027 ( 40.84%) 3.6100 ( 33.32%) 3.5620 ( 34.21%) 3.5590 ( 34.26%) 4.6573 ( 13.98%) Amean 12 9.7130 ( 0.00%) 4.7170 ( 51.44%) 6.5280 ( 32.79%) 6.2063 ( 36.10%) 6.5670 ( 32.39%) 10.5440 ( -8.56%) Amean 21 11.6687 ( 0.00%) 8.8073 ( 24.52%) 14.4997 ( -24.26%) 10.2700 ( 11.99%) 14.3187 ( -22.71%) 11.5417 ( 1.09%) Amean 30 14.6410 ( 0.00%) 11.7003 ( 20.09%) 23.7660 ( -62.32%) 13.9847 ( 4.48%) 21.8957 ( -49.55%) 14.4847 ( 1.07%) Amean 48 19.8780 ( 0.00%) 17.0317 ( 14.32%) 37.6397 ( -89.35%) 19.7577 ( 0.61%) 39.2110 ( -97.26%) 20.3293 ( -2.27%) Amean 79 46.4200 ( 0.00%) 27.1180 ( 41.58%) 58.4037 ( -25.82%) 35.5537 ( 23.41%) 60.8957 ( -31.18%) 49.7470 ( -7.17%) Amean 110 57.7550 ( 0.00%) 42.7013 ( 26.06%) 73.0483 ( -26.48%) 48.8880 ( 15.35%) 77.8597 ( -34.81%) 61.9353 ( -7.24%) Amean 141 61.0490 ( 0.00%) 48.0857 ( 21.23%) 98.5567 ( -61.44%) 63.2187 ( -3.55%) 90.4857 ( -48.22%) 68.3100 ( -11.89%) Amean 172 70.5180 ( 0.00%) 59.3620 ( 15.82%) 122.5423 ( -73.77%) 76.0197 ( -7.80%) 127.4023 ( -80.67%) 75.8233 ( -7.52%) Amean 192 76.1643 ( 0.00%) 65.1613 ( 14.45%) 142.1393 ( -86.62%) 91.4923 ( -20.12%) 145.0663 ( -90.46%) 80.5867 ( -5.81%) But things look pretty good for hackbench-process-sockets: hackbench-process-sockets 4.12.0 4.13.0 4.13.0 4.13.0 4.13.0 4.13.0 vanilla vanilla peterz-fix rik-fixpeterz-fix-v2-no-wa-weightpeterz-fix-v2-wa-weight Amean 1 0.9657 ( 0.00%) 1.0850 ( -12.36%) 1.3737 ( -42.25%) 1.3093 ( -35.59%) 1.3220 ( -36.90%) 1.3937 ( -44.32%) Amean 4 2.3040 ( 0.00%) 3.3840 ( -46.88%) 2.1807 ( 5.35%) 2.3010 ( 0.13%) 2.2070 ( 4.21%) 2.1770 ( 5.51%) Amean 7 4.5467 ( 0.00%) 4.0787 ( 10.29%) 5.0530 ( -11.14%) 3.7427 ( 17.68%) 4.5517 ( -0.11%) 3.8560 ( 15.19%) Amean 12 5.7707 ( 0.00%) 5.4440 ( 5.66%) 10.5680 ( -83.13%) 7.7240 ( -33.85%) 10.5990 ( -83.67%) 5.9123 ( -2.45%) Amean 21 8.9387 ( 0.00%) 9.5850 ( -7.23%) 18.3103 (-104.84%) 10.9253 ( -22.23%) 18.1540 (-103.10%) 9.2627 ( -3.62%) Amean 30 13.1243 ( 0.00%) 14.0773 ( -7.26%) 25.6563 ( -95.49%) 15.7590 ( -20.07%) 25.6920 ( -95.76%) 14.6523 ( -11.64%) Amean 48 25.1030 ( 0.00%) 22.5233 ( 10.28%) 40.5937 ( -61.71%) 24.6727 ( 1.71%) 40.6357 ( -61.88%) 22.1807 ( 11.64%) Amean 79 39.9150 ( 0.00%) 33.4220 ( 16.27%) 66.3343 ( -66.19%) 40.2713 ( -0.89%) 65.8543 ( -64.99%) 35.3360 ( 11.47%) Amean 110 49.1700 ( 0.00%) 46.1173 ( 6.21%) 92.3153 ( -87.75%) 55.6567 ( -13.19%) 92.0567 ( -87.22%) 46.7280 ( 4.97%) Amean 141 59.3157 ( 0.00%) 57.2670 ( 3.45%) 118.5863 ( -99.92%) 70.4800 ( -18.82%) 118.6013 ( -99.95%) 57.8247 ( 2.51%) Amean 172 69.8163 ( 0.00%) 68.2817 ( 2.20%) 145.7583 (-108.77%) 83.0167 ( -18.91%) 144.4477 (-106.90%) 68.4457 ( 1.96%) Amean 192 75.9913 ( 0.00%) 76.0503 ( -0.08%) 159.8487 (-110.35%) 91.0133 ( -19.77%) 159.6793 (-110.13%) 76.2690 ( -0.37%) It's a similar story for hackbench-threads-{pipes,sockets}, i.e. pipes regress but performance is restored for sockets. Of course, like a dope, I forgot to re-run netperf with your WA_WEIGHT patch. So I've queued that up now and it should be done by tomorrow.
On Fri, 06 Oct, at 11:36:23AM, Matt Fleming wrote: > > It's a similar story for hackbench-threads-{pipes,sockets}, i.e. pipes > regress but performance is restored for sockets. > > Of course, like a dope, I forgot to re-run netperf with your WA_WEIGHT > patch. So I've queued that up now and it should be done by tomorrow. Yeah, netperf results look fine for either your NO_WA_WEIGHT or WA_WEIGHT patch. Any ETA on when this is going to tip?
On Tue, Oct 10, 2017 at 03:51:37PM +0100, Matt Fleming wrote: > On Fri, 06 Oct, at 11:36:23AM, Matt Fleming wrote: > > > > It's a similar story for hackbench-threads-{pipes,sockets}, i.e. pipes > > regress but performance is restored for sockets. > > > > Of course, like a dope, I forgot to re-run netperf with your WA_WEIGHT > > patch. So I've queued that up now and it should be done by tomorrow. > > Yeah, netperf results look fine for either your NO_WA_WEIGHT or > WA_WEIGHT patch. > > Any ETA on when this is going to tip? Just hit a few hours ago :-)
* Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Oct 10, 2017 at 03:51:37PM +0100, Matt Fleming wrote: > > On Fri, 06 Oct, at 11:36:23AM, Matt Fleming wrote: > > > > > > It's a similar story for hackbench-threads-{pipes,sockets}, i.e. pipes > > > regress but performance is restored for sockets. > > > > > > Of course, like a dope, I forgot to re-run netperf with your WA_WEIGHT > > > patch. So I've queued that up now and it should be done by tomorrow. > > > > Yeah, netperf results look fine for either your NO_WA_WEIGHT or > > WA_WEIGHT patch. > > > > Any ETA on when this is going to tip? > > Just hit a few hours ago :-) I admit that time machines are really handy! Thanks, Ingo
On 10/10/2017 07:26 PM, Ingo Molnar wrote: > > * Peter Zijlstra <peterz@infradead.org> wrote: > >> On Tue, Oct 10, 2017 at 03:51:37PM +0100, Matt Fleming wrote: >>> On Fri, 06 Oct, at 11:36:23AM, Matt Fleming wrote: >>>> >>>> It's a similar story for hackbench-threads-{pipes,sockets}, i.e. pipes >>>> regress but performance is restored for sockets. >>>> >>>> Of course, like a dope, I forgot to re-run netperf with your WA_WEIGHT >>>> patch. So I've queued that up now and it should be done by tomorrow. >>> >>> Yeah, netperf results look fine for either your NO_WA_WEIGHT or >>> WA_WEIGHT patch. >>> >>> Any ETA on when this is going to tip? >> >> Just hit a few hours ago :-) > > I admit that time machines are really handy! > > Thanks, Are we going to schedule this for 4.13 stable as well?
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index d7b6dab956ec..7d065abc7a47 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -71,14 +71,6 @@ struct sched_domain_shared { atomic_t ref; atomic_t nr_busy_cpus; int has_idle_cores; - - /* - * Some variables from the most recent sd_lb_stats for this domain, - * used by wake_affine(). - */ - unsigned long nr_running; - unsigned long load; - unsigned long capacity; }; struct sched_domain { diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 350dbec01523..a1a6b6f52660 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5638,91 +5638,60 @@ static int wake_wide(struct task_struct *p) return 1; } -struct llc_stats { - unsigned long nr_running; - unsigned long load; - unsigned long capacity; - int has_capacity; -}; +/* + * The purpose of wake_affine() is to quickly determine on which CPU we can run + * soonest. For the purpose of speed we only consider the waking and previous + * CPU. + * + * wake_affine_idle() - only considers 'now', it check if the waking CPU is (or + * will be) idle. + * + * wake_affine_weight() - considers the weight to reflect the average + * scheduling latency of the CPUs. This seems to work + * for the overloaded case. + */ -static bool get_llc_stats(struct llc_stats *stats, int cpu) +static bool +wake_affine_idle(struct sched_domain *sd, struct task_struct *p, + int this_cpu, int prev_cpu, int sync) { - struct sched_domain_shared *sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); - - if (!sds) - return false; + if (idle_cpu(this_cpu)) + return true; - stats->nr_running = READ_ONCE(sds->nr_running); - stats->load = READ_ONCE(sds->load); - stats->capacity = READ_ONCE(sds->capacity); - stats->has_capacity = stats->nr_running < per_cpu(sd_llc_size, cpu); + if (sync && cpu_rq(this_cpu)->nr_running == 1) + return true; - return true; + return false; } -/* - * Can a task be moved from prev_cpu to this_cpu without causing a load - * imbalance that would trigger the load balancer? - * - * Since we're running on 'stale' values, we might in fact create an imbalance - * but recomputing these values is expensive, as that'd mean iteration 2 cache - * domains worth of CPUs. - */ static bool -wake_affine_llc(struct sched_domain *sd, struct task_struct *p, - int this_cpu, int prev_cpu, int sync) +wake_affine_weight(struct sched_domain *sd, struct task_struct *p, + int this_cpu, int prev_cpu, int sync) { - struct llc_stats prev_stats, this_stats; s64 this_eff_load, prev_eff_load; unsigned long task_load; - if (!get_llc_stats(&prev_stats, prev_cpu) || - !get_llc_stats(&this_stats, this_cpu)) - return false; + this_eff_load = target_load(this_cpu, sd->wake_idx); + prev_eff_load = source_load(prev_cpu, sd->wake_idx); - /* - * If sync wakeup then subtract the (maximum possible) - * effect of the currently running task from the load - * of the current LLC. - */ if (sync) { unsigned long current_load = task_h_load(current); - /* in this case load hits 0 and this LLC is considered 'idle' */ - if (current_load > this_stats.load) + if (current_load > this_eff_load) return true; - this_stats.load -= current_load; + this_eff_load -= current_load; } - /* - * The has_capacity stuff is not SMT aware, but by trying to balance - * the nr_running on both ends we try and fill the domain at equal - * rates, thereby first consuming cores before siblings. - */ - - /* if the old cache has capacity, stay there */ - if (prev_stats.has_capacity && prev_stats.nr_running < this_stats.nr_running+1) - return false; - - /* if this cache has capacity, come here */ - if (this_stats.has_capacity && this_stats.nr_running+1 < prev_stats.nr_running) - return true; - - /* - * Check to see if we can move the load without causing too much - * imbalance. - */ task_load = task_h_load(p); - this_eff_load = 100; - this_eff_load *= prev_stats.capacity; - - prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2; - prev_eff_load *= this_stats.capacity; + this_eff_load += task_load; + this_eff_load *= 100; + this_eff_load *= capacity_of(prev_cpu); - this_eff_load *= this_stats.load + task_load; - prev_eff_load *= prev_stats.load - task_load; + prev_eff_load -= task_load; + prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2; + prev_eff_load *= capacity_of(this_cpu); return this_eff_load <= prev_eff_load; } @@ -5731,22 +5700,13 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int prev_cpu, int sync) { int this_cpu = smp_processor_id(); - bool affine; + bool affine = false; - /* - * Default to no affine wakeups; wake_affine() should not effect a task - * placement the load-balancer feels inclined to undo. The conservative - * option is therefore to not move tasks when they wake up. - */ - affine = false; + if (sched_feat(WA_IDLE) && !affine) + affine = wake_affine_idle(sd, p, this_cpu, prev_cpu, sync); - /* - * If the wakeup is across cache domains, try to evaluate if movement - * makes sense, otherwise rely on select_idle_siblings() to do - * placement inside the cache domain. - */ - if (!cpus_share_cache(prev_cpu, this_cpu)) - affine = wake_affine_llc(sd, p, this_cpu, prev_cpu, sync); + if (sched_feat(WA_WEIGHT) && !affine) + affine = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync); schedstat_inc(p->se.statistics.nr_wakeups_affine_attempts); if (affine) { @@ -7895,7 +7855,6 @@ static inline enum fbq_type fbq_classify_rq(struct rq *rq) */ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds) { - struct sched_domain_shared *shared = env->sd->shared; struct sched_domain *child = env->sd->child; struct sched_group *sg = env->sd->groups; struct sg_lb_stats *local = &sds->local_stat; @@ -7967,22 +7926,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd if (env->dst_rq->rd->overload != overload) env->dst_rq->rd->overload = overload; } - - if (!shared) - return; - - /* - * Since these are sums over groups they can contain some CPUs - * multiple times for the NUMA domains. - * - * Currently only wake_affine_llc() and find_busiest_group() - * uses these numbers, only the last is affected by this problem. - * - * XXX fix that. - */ - WRITE_ONCE(shared->nr_running, sds->total_running); - WRITE_ONCE(shared->load, sds->total_load); - WRITE_ONCE(shared->capacity, sds->total_capacity); } /** diff --git a/kernel/sched/features.h b/kernel/sched/features.h index d3fb15555291..d40d33ec935f 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -81,3 +81,5 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true) SCHED_FEAT(LB_MIN, false) SCHED_FEAT(ATTACH_AGE_LOAD, true) +SCHED_FEAT(WA_IDLE, true) +SCHED_FEAT(WA_WEIGHT, false)