[RFCv4,31/34] sched: Energy-aware wake-up task placement

Message ID	1431459549-18343-32-git-send-email-morten.rasmussen@arm.com (mailing list archive)
State	RFC
Headers	show Return-Path: <linux-pm-owner@kernel.org> From: Morten Rasmussen <morten.rasmussen@arm.com> To: peterz@infradead.org, mingo@redhat.com Cc: vincent.guittot@linaro.org, Dietmar Eggemann <Dietmar.Eggemann@arm.com>, yuyang.du@intel.com, preeti@linux.vnet.ibm.com, mturquette@linaro.org, rjw@rjwysocki.net, Juri Lelli <Juri.Lelli@arm.com>, sgurrappadi@nvidia.com, pang.xunlei@zte.com.cn, linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, morten.rasmussen@arm.com Subject: [RFCv4 PATCH 31/34] sched: Energy-aware wake-up task placement Date: Tue, 12 May 2015 20:39:06 +0100 Message-Id: <1431459549-18343-32-git-send-email-morten.rasmussen@arm.com> In-Reply-To: <1431459549-18343-1-git-send-email-morten.rasmussen@arm.com> References: <1431459549-18343-1-git-send-email-morten.rasmussen@arm.com> Sender: linux-pm-owner@vger.kernel.org Precedence: bulk

Message ID

1431459549-18343-32-git-send-email-morten.rasmussen@arm.com (mailing list archive)

State

RFC

Headers

From: Morten Rasmussen <morten.rasmussen@arm.com>
To: peterz@infradead.org, mingo@redhat.com
Cc: vincent.guittot@linaro.org, Dietmar Eggemann <Dietmar.Eggemann@arm.com>,
	yuyang.du@intel.com, preeti@linux.vnet.ibm.com,
	mturquette@linaro.org, rjw@rjwysocki.net,
	Juri Lelli <Juri.Lelli@arm.com>, sgurrappadi@nvidia.com,
	pang.xunlei@zte.com.cn, linux-kernel@vger.kernel.org,
	linux-pm@vger.kernel.org, morten.rasmussen@arm.com
Subject: [RFCv4 PATCH 31/34] sched: Energy-aware wake-up task placement
Date: Tue, 12 May 2015 20:39:06 +0100
Message-Id: <1431459549-18343-32-git-send-email-morten.rasmussen@arm.com>
In-Reply-To: <1431459549-18343-1-git-send-email-morten.rasmussen@arm.com>
References: <1431459549-18343-1-git-send-email-morten.rasmussen@arm.com>
Sender: linux-pm-owner@vger.kernel.org
Precedence: bulk

Commit Message

Morten Rasmussen May 12, 2015, 7:39 p.m. UTC

Let available compute capacity and estimated energy impact select
wake-up target cpu when energy-aware scheduling is enabled and the
system in not over-utilized (above the tipping point).

energy_aware_wake_cpu() attempts to find group of cpus with sufficient
compute capacity to accommodate the task and find a cpu with enough spare
capacity to handle the task within that group. Preference is given to
cpus with enough spare capacity at the current OPP. Finally, the energy
impact of the new target and the previous task cpu is compared to select
the wake-up target cpu.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 84 insertions(+), 1 deletion(-)

Comments

Morten Rasmussen May 14, 2015, 3:10 p.m. UTC | #1

On Thu, May 14, 2015 at 10:34:20AM +0100, pang.xunlei@zte.com.cn wrote:
> Morten Rasmussen <morten.rasmussen@arm.com> wrote 2015-05-13 AM 03:39:06:
> > [RFCv4 PATCH 31/34] sched: Energy-aware wake-up task placement
> >
> > Let available compute capacity and estimated energy impact select
> > wake-up target cpu when energy-aware scheduling is enabled and the
> > system in not over-utilized (above the tipping point).
> >
> > energy_aware_wake_cpu() attempts to find group of cpus with sufficient
> > compute capacity to accommodate the task and find a cpu with enough spare
> > capacity to handle the task within that group. Preference is given to
> > cpus with enough spare capacity at the current OPP. Finally, the energy
> > impact of the new target and the previous task cpu is compared to select
> > the wake-up target cpu.
> >
> > cc: Ingo Molnar <mingo@redhat.com>
> > cc: Peter Zijlstra <peterz@infradead.org>
> >
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > ---
> >  kernel/sched/fair.c | 85 ++++++++++++++++++++++++++++++++++++++++++
> > ++++++++++-
> >  1 file changed, 84 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index bb44646..fe41e1e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5394,6 +5394,86 @@ static int select_idle_sibling(struct
> > task_struct *p, int target)
> >     return target;
> >  }
> >
> > +static int energy_aware_wake_cpu(struct task_struct *p)
> > +{
> > +   struct sched_domain *sd;
> > +   struct sched_group *sg, *sg_target;
> > +   int target_max_cap = INT_MAX;
> > +   int target_cpu = task_cpu(p);
> > +   int i;
> > +
> > +   sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
> > +
> > +   if (!sd)
> > +      return -1;
> > +
> > +   sg = sd->groups;
> > +   sg_target = sg;
> > +
> > +   /*
> > +    * Find group with sufficient capacity. We only get here if no cpu is
> > +    * overutilized. We may end up overutilizing a cpu by adding the task,
> > +    * but that should not be any worse than select_idle_sibling().
> > +    * load_balance() should sort it out later as we get above the tipping
> > +    * point.
> > +    */
> > +   do {
> > +      /* Assuming all cpus are the same in group */
> > +      int max_cap_cpu = group_first_cpu(sg);
> > +
> > +      /*
> > +       * Assume smaller max capacity means more energy-efficient.
> > +       * Ideally we should query the energy model for the right
> > +       * answer but it easily ends up in an exhaustive search.
> > +       */
> > +      if (capacity_of(max_cap_cpu) < target_max_cap &&
> > +          task_fits_capacity(p, max_cap_cpu)) {
> > +         sg_target = sg;
> > +         target_max_cap = capacity_of(max_cap_cpu);
> > +      }
> > +   } while (sg = sg->next, sg != sd->groups);
> > +
> > +   /* Find cpu with sufficient capacity */
> > +   for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) {
> > +      /*
> > +       * p's blocked utilization is still accounted for on prev_cpu
> > +       * so prev_cpu will receive a negative bias due the double
> > +       * accouting. However, the blocked utilization may be zero.
> > +       */
> > +      int new_usage = get_cpu_usage(i) + task_utilization(p);
> > +
> > +      if (new_usage >   capacity_orig_of(i))
> > +         continue;
> > +
> > +      if (new_usage <   capacity_curr_of(i)) {
> > +         target_cpu = i;
> > +         if (cpu_rq(i)->nr_running)
> > +            break;
> > +      }
> > +
> > +      /* cpu has capacity at higher OPP, keep it as fallback */
> > +      if (target_cpu == task_cpu(p))
> > +         target_cpu = i;
> > +   }
> > +
> > +   if (target_cpu != task_cpu(p)) {
> > +      struct energy_env eenv = {
> > +         .usage_delta   = task_utilization(p),
> > +         .src_cpu   = task_cpu(p),
> > +         .dst_cpu   = target_cpu,
> > +      };
> 
> At this point, p hasn't been queued in src_cpu, but energy_diff() below will
> still substract its utilization from src_cpu, is that right?

energy_aware_wake_cpu() should only be called for existing tasks, i.e.
SD_BALANCE_WAKE, so p should have been queued on src_cpu in the past.
New tasks (SD_BALANCE_FORK) take the find_idlest_{group, cpu}() route.

Or did I miss something?

Since p was last scheduled on src_cpu its usage should still be
accounted for in the blocked utilization of that cpu. At wake-up we are
effectively turning blocked utilization into runnable utilization. The
cpu usage (get_cpu_usage()) is the sum of the two and this is basis for
the energy calculations. So if we migrate the task at wake-up we should
remove the task utilization from the previous cpu and add it to dst_cpu.

As Sai has raised previously, it is not the full story. The blocked
utilization contribution of p on the previous cpu may have decayed while
the task utilization stored in p->se.avg has not. It is therefore
misleading to subtract the non-decayed utilization from src_cpu blocked
utilization. It is on the todo-list to fix that issue.

Does that make any sense?

Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bb44646..fe41e1e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5394,6 +5394,86 @@  static int select_idle_sibling(struct task_struct *p, int target)
 	return target;
 }
 
+static int energy_aware_wake_cpu(struct task_struct *p)
+{
+	struct sched_domain *sd;
+	struct sched_group *sg, *sg_target;
+	int target_max_cap = INT_MAX;
+	int target_cpu = task_cpu(p);
+	int i;
+
+	sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
+
+	if (!sd)
+		return -1;
+
+	sg = sd->groups;
+	sg_target = sg;
+
+	/*
+	 * Find group with sufficient capacity. We only get here if no cpu is
+	 * overutilized. We may end up overutilizing a cpu by adding the task,
+	 * but that should not be any worse than select_idle_sibling().
+	 * load_balance() should sort it out later as we get above the tipping
+	 * point.
+	 */
+	do {
+		/* Assuming all cpus are the same in group */
+		int max_cap_cpu = group_first_cpu(sg);
+
+		/*
+		 * Assume smaller max capacity means more energy-efficient.
+		 * Ideally we should query the energy model for the right
+		 * answer but it easily ends up in an exhaustive search.
+		 */
+		if (capacity_of(max_cap_cpu) < target_max_cap &&
+		    task_fits_capacity(p, max_cap_cpu)) {
+			sg_target = sg;
+			target_max_cap = capacity_of(max_cap_cpu);
+		}
+	} while (sg = sg->next, sg != sd->groups);
+
+	/* Find cpu with sufficient capacity */
+	for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) {
+		/*
+		 * p's blocked utilization is still accounted for on prev_cpu
+		 * so prev_cpu will receive a negative bias due the double
+		 * accouting. However, the blocked utilization may be zero.
+		 */
+		int new_usage = get_cpu_usage(i) + task_utilization(p);
+
+		if (new_usage >	capacity_orig_of(i))
+			continue;
+
+		if (new_usage <	capacity_curr_of(i)) {
+			target_cpu = i;
+			if (cpu_rq(i)->nr_running)
+				break;
+		}
+
+		/* cpu has capacity at higher OPP, keep it as fallback */
+		if (target_cpu == task_cpu(p))
+			target_cpu = i;
+	}
+
+	if (target_cpu != task_cpu(p)) {
+		struct energy_env eenv = {
+			.usage_delta	= task_utilization(p),
+			.src_cpu	= task_cpu(p),
+			.dst_cpu	= target_cpu,
+		};
+
+		/* Not enough spare capacity on previous cpu */
+		if (cpu_overutilized(task_cpu(p)))
+			return target_cpu;
+
+		if (energy_diff(&eenv) >= 0)
+			return task_cpu(p);
+	}
+
+	return target_cpu;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5446,7 +5526,10 @@  select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		prev_cpu = cpu;
 
 	if (sd_flag & SD_BALANCE_WAKE && want_sibling) {
-		new_cpu = select_idle_sibling(p, prev_cpu);
+		if (energy_aware() && !cpu_rq(cpu)->rd->overutilized)
+			new_cpu = energy_aware_wake_cpu(p);
+		else
+			new_cpu = select_idle_sibling(p, prev_cpu);
 		goto unlock;
 	}

[RFCv4,31/34] sched: Energy-aware wake-up task placement

Commit Message

Comments

Patch