diff mbox

[RFCv4,11/34] sched: Remove blocked load and utilization contributions of dying tasks

Message ID 1431459549-18343-12-git-send-email-morten.rasmussen@arm.com (mailing list archive)
State RFC
Headers show

Commit Message

Morten Rasmussen May 12, 2015, 7:38 p.m. UTC
Task being dequeued for the last time (state == TASK_DEAD) are dequeued
with the DEQUEUE_SLEEP flag which causes their load and utilization
contributions to be added to the runqueue blocked load and utilization.
Hence they will contain load or utilization that is gone away. The issue
only exists for the root cfs_rq as cgroup_exit() doesn't set
DEQUEUE_SLEEP for task group exits.

If runnable+blocked load is to be used as a better estimate for cpu
load the dead task contributions need to be removed to prevent
load_balance() (idle_balance() in particular) from over-estimating the
cpu load.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Sai May 13, 2015, 12:33 a.m. UTC | #1
On 05/12/2015 12:38 PM, Morten Rasmussen wrote:
> Task being dequeued for the last time (state == TASK_DEAD) are dequeued
> with the DEQUEUE_SLEEP flag which causes their load and utilization
> contributions to be added to the runqueue blocked load and utilization.
> Hence they will contain load or utilization that is gone away. The issue
> only exists for the root cfs_rq as cgroup_exit() doesn't set
> DEQUEUE_SLEEP for task group exits.
> 
> If runnable+blocked load is to be used as a better estimate for cpu
> load the dead task contributions need to be removed to prevent
> load_balance() (idle_balance() in particular) from over-estimating the
> cpu load.
> 
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
> 
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e40cd88..d045404 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3202,6 +3202,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	 * Update run-time statistics of the 'current'.
>  	 */
>  	update_curr(cfs_rq);
> +	if (entity_is_task(se) && task_of(se)->state == TASK_DEAD)
> +		flags &= !DEQUEUE_SLEEP;
>  	dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
>  
>  	update_stats_dequeue(cfs_rq, se);
> 

Maybe you could use the sched_class->task_dead() callback instead? I
remember seeing a patch from Yuyang that did that.
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Morten Rasmussen May 13, 2015, 1:49 p.m. UTC | #2
On Wed, May 13, 2015 at 01:33:22AM +0100, Sai Gurrappadi wrote:
> On 05/12/2015 12:38 PM, Morten Rasmussen wrote:
> > Task being dequeued for the last time (state == TASK_DEAD) are dequeued
> > with the DEQUEUE_SLEEP flag which causes their load and utilization
> > contributions to be added to the runqueue blocked load and utilization.
> > Hence they will contain load or utilization that is gone away. The issue
> > only exists for the root cfs_rq as cgroup_exit() doesn't set
> > DEQUEUE_SLEEP for task group exits.
> > 
> > If runnable+blocked load is to be used as a better estimate for cpu
> > load the dead task contributions need to be removed to prevent
> > load_balance() (idle_balance() in particular) from over-estimating the
> > cpu load.
> > 
> > cc: Ingo Molnar <mingo@redhat.com>
> > cc: Peter Zijlstra <peterz@infradead.org>
> > 
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > ---
> >  kernel/sched/fair.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e40cd88..d045404 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3202,6 +3202,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >  	 * Update run-time statistics of the 'current'.
> >  	 */
> >  	update_curr(cfs_rq);
> > +	if (entity_is_task(se) && task_of(se)->state == TASK_DEAD)
> > +		flags &= !DEQUEUE_SLEEP;
> >  	dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
> >  
> >  	update_stats_dequeue(cfs_rq, se);
> > 
> 
> Maybe you could use the sched_class->task_dead() callback instead? I
> remember seeing a patch from Yuyang that did that.

Now that you mention it, I remember that thread I think. I will have a
look again and see if there are any good reasons not to use task_dead().
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Morten Rasmussen May 19, 2015, 2:22 p.m. UTC | #3
On Wed, May 13, 2015 at 02:49:57PM +0100, Morten Rasmussen wrote:
> On Wed, May 13, 2015 at 01:33:22AM +0100, Sai Gurrappadi wrote:
> > On 05/12/2015 12:38 PM, Morten Rasmussen wrote:
> > > Task being dequeued for the last time (state == TASK_DEAD) are dequeued
> > > with the DEQUEUE_SLEEP flag which causes their load and utilization
> > > contributions to be added to the runqueue blocked load and utilization.
> > > Hence they will contain load or utilization that is gone away. The issue
> > > only exists for the root cfs_rq as cgroup_exit() doesn't set
> > > DEQUEUE_SLEEP for task group exits.
> > > 
> > > If runnable+blocked load is to be used as a better estimate for cpu
> > > load the dead task contributions need to be removed to prevent
> > > load_balance() (idle_balance() in particular) from over-estimating the
> > > cpu load.
> > > 
> > > cc: Ingo Molnar <mingo@redhat.com>
> > > cc: Peter Zijlstra <peterz@infradead.org>
> > > 
> > > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > > ---
> > >  kernel/sched/fair.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > > 
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index e40cd88..d045404 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -3202,6 +3202,8 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > >  	 * Update run-time statistics of the 'current'.
> > >  	 */
> > >  	update_curr(cfs_rq);
> > > +	if (entity_is_task(se) && task_of(se)->state == TASK_DEAD)
> > > +		flags &= !DEQUEUE_SLEEP;
> > >  	dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
> > >  
> > >  	update_stats_dequeue(cfs_rq, se);
> > > 
> > 
> > Maybe you could use the sched_class->task_dead() callback instead? I
> > remember seeing a patch from Yuyang that did that.
> 
> Now that you mention it, I remember that thread I think. I will have a
> look again and see if there are any good reasons not to use task_dead().

After having looked at it again I think using task_dead() is another
option, but it isn't the most elegant one. task_dead() is called very
late in the game, after the task has been dequeued and the rq lock has
been released. Doing it there would require re-taking the rq lock to
remove the task contribution from the rq blocked load that was just
added as part of the dequeue operation.

The above patch ensures that the task contribution isn't added to the rq
blocked load at dequeue when we know the task is dead instead of having
to fix it later. The problem seems to arise due schedule() not really
caring if the previous task is going to die or just sleep. It passed
DEQUEUE_SLEEP to the sched_class in both cases and we therefore have
to do the distinction of the two cases in fair.c.

Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e40cd88..d045404 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3202,6 +3202,8 @@  dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * Update run-time statistics of the 'current'.
 	 */
 	update_curr(cfs_rq);
+	if (entity_is_task(se) && task_of(se)->state == TASK_DEAD)
+		flags &= !DEQUEUE_SLEEP;
 	dequeue_entity_load_avg(cfs_rq, se, flags & DEQUEUE_SLEEP);
 
 	update_stats_dequeue(cfs_rq, se);