diff mbox series

[v4] sched/core: Preempt current task in favour of bound kthread

Message ID 20191211173829.GB21797@linux.vnet.ibm.com (mailing list archive)
State Deferred, archived
Headers show
Series [v4] sched/core: Preempt current task in favour of bound kthread | expand

Commit Message

Srikar Dronamraju Dec. 11, 2019, 5:38 p.m. UTC
A running task can wake-up a per CPU bound kthread on the same CPU.
If the current running task doesn't yield the CPU before the next load
balance operation, the scheduler would detect load imbalance and try to
balance the load. However this load balance would fail as the waiting
task is CPU bound, while the running task cannot be moved by the regular
load balancer. Finally the active load balancer would kick in and move
the task to a different CPU/Core. Moving the task to a different
CPU/core can lead to loss in cache affinity leading to poor performance.

This is more prone to happen if the current running task is CPU
intensive and the sched_wake_up_granularity is set to larger value.
When the sched_wake_up_granularity was relatively small, it was observed
that the bound thread would complete before the load balancer would have
chosen to move the cache hot task to a different CPU.

To deal with this situation, the current running task would yield to a
per CPU bound kthread, provided kthread is not CPU intensive.

/pboffline/hwcct_prg_old/lib/fsperf -t overwrite --noclean -f 5g -b 4k /pboffline

(With sched_wake_up_granularity set to 15ms)

Performance counter stats for 'system wide' (5 runs):
event					    v5.4 				v5.4 + patch(v3)
probe:active_load_balance_cpu_stop       1,919  ( +-  2.89% )                     4  ( +- 20.48% )
sched:sched_waking                     441,535  ( +-  0.17% )               914,630  ( +-  0.18% )
sched:sched_wakeup                     441,533  ( +-  0.17% )               914,630  ( +-  0.18% )
sched:sched_wakeup_new                   2,436  ( +-  8.08% )                   545  ( +-  4.02% )
sched:sched_switch                     797,007  ( +-  0.26% )             1,490,261  ( +-  0.10% )
sched:sched_migrate_task                20,998  ( +-  1.04% )                 2,492  ( +- 11.56% )
sched:sched_process_free                 2,436  ( +-  7.90% )                   526  ( +-  3.65% )
sched:sched_process_exit                 2,451  ( +-  7.85% )                   546  ( +-  4.06% )
sched:sched_wait_task                        7  ( +- 21.20% )                     1  ( +- 40.82% )
sched:sched_process_wait                 3,951  ( +-  9.14% )                   854  ( +-  5.33% )
sched:sched_process_fork                 2,435  ( +-  8.09% )                   545  ( +-  3.96% )
sched:sched_process_exec                 1,023  ( +- 12.21% )                   205  ( +-  5.13% )
sched:sched_wake_idle_without_ipi      187,794  ( +-  1.14% )               353,579  ( +-  0.42% )

Elasped time in seconds          289.43 +- 1.42 ( +-  0.49% )      72.7318 +- 0.0545 ( +-  0.07% )

Throughput results

v5.4
Trigger time:................... 0.842679 s   (Throughput:     6075.86 MB/s)
Asynchronous submit time:.......   1.0184 s   (Throughput:     5027.49 MB/s)
Synchronous submit time:........        0 s   (Throughput:           0 MB/s)
I/O time:.......................   263.17 s   (Throughput:      19.455 MB/s)
Ratio trigger time to I/O time:.0.00320202

v5.4 + patch(v3)
Trigger time:................... 0.852413 s   (Throughput:     6006.47 MB/s)
Asynchronous submit time:....... 0.773043 s   (Throughput:     6623.17 MB/s)
Synchronous submit time:........        0 s   (Throughput:           0 MB/s)
I/O time:.......................   44.341 s   (Throughput:     115.468 MB/s)
Ratio trigger time to I/O time:. 0.019224

(With sched_wake_up_granularity set to 4ms)

Performance counter stats for 'system wide' (5 runs):
event					      v5.4 				    v5.4 + patch(v3)
probe:active_load_balance_cpu_stop               6  ( +-  6.03% )                      5  ( +- 15.04% )
sched:sched_waking                         899,880  ( +-  0.38% )                912,625  ( +-  0.41% )
sched:sched_wakeup                         899,878  ( +-  0.38% )                912,624  ( +-  0.41% )
sched:sched_wakeup_new                         622  ( +- 11.95% )                    550  ( +-  3.85% )
sched:sched_switch                       1,458,214  ( +-  0.40% )              1,489,032  ( +-  0.41% )
sched:sched_migrate_task                     3,120  ( +- 10.00% )                  2,524  ( +-  5.54% )
sched:sched_process_free                       608  ( +- 12.18% )                    528  ( +-  3.89% )
sched:sched_process_exit                       623  ( +- 11.91% )                    550  ( +-  3.79% )
sched:sched_wait_task                            1  ( +- 31.18% )                      1  ( +- 66.67% )
sched:sched_process_wait                       998  ( +- 13.22% )                    867  ( +-  4.41% )
sched:sched_process_fork                       622  ( +- 11.95% )                    550  ( +-  3.88% )
sched:sched_process_exec                       242  ( +- 13.81% )                    208  ( +-  4.57% )
sched:sched_wake_idle_without_ipi          349,165  ( +-  0.35% )                352,443  ( +-  0.21% )

Elasped time in seconds           72.8560 +- 0.0768 ( +-  0.11% )        72.5523 +- 0.0725 ( +-  0.10% )

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog:
v1 : http://lore.kernel.org/lkml/20191209165122.GA27229@linux.vnet.ibm.com
v2 : http://lore.kernel.org/lkml/20191210054330.GF27253@linux.vnet.ibm.com
v3 : http://lore.kernel.org/lkml/20191210172307.GD9139@linux.vnet.ibm.com
v1->v2: Pass the the right params to try_to_wake_up as correctly pointed out
	by Dave Chinner
v2->v3: Suggestions from Peter Zijlstra including using vtime over
	context switch and detect per-cpu-kthread in try_to_wake_up
v3->v4: Fixed a compilation failed under !CONFIG_SMP reported by
	kbuild test robot <lkp@intel.com> as is_per_cpu_kthread is only
	defined for CONFIG_SMP.

 kernel/sched/core.c  | 5 +++++
 kernel/sched/fair.c  | 2 +-
 kernel/sched/sched.h | 3 ++-
 3 files changed, 8 insertions(+), 2 deletions(-)

Comments

Dave Chinner Dec. 11, 2019, 10:46 p.m. UTC | #1
On Wed, Dec 11, 2019 at 11:08:29PM +0530, Srikar Dronamraju wrote:
> A running task can wake-up a per CPU bound kthread on the same CPU.
> If the current running task doesn't yield the CPU before the next load
> balance operation, the scheduler would detect load imbalance and try to
> balance the load. However this load balance would fail as the waiting
> task is CPU bound, while the running task cannot be moved by the regular
> load balancer. Finally the active load balancer would kick in and move
> the task to a different CPU/Core. Moving the task to a different
> CPU/core can lead to loss in cache affinity leading to poor performance.
> 
> This is more prone to happen if the current running task is CPU
> intensive and the sched_wake_up_granularity is set to larger value.
> When the sched_wake_up_granularity was relatively small, it was observed
> that the bound thread would complete before the load balancer would have
> chosen to move the cache hot task to a different CPU.
> 
> To deal with this situation, the current running task would yield to a
> per CPU bound kthread, provided kthread is not CPU intensive.

So a question for you here: when does the workqueue worker pre-empt
the currently running task? Is it immediately? Or when a time-slice
of the currently running task runs out?

We don't want queued work immediately pre-empting the task that
queued the work - the queued work is *deferred* work that should be
run _soon_ but we want the currently running task to finish what it
is doing first if possible. i.e. these are not synchronous wakeups,
and so we shouldn't schedule kworker threads as though they are sync
wakeups. That will affect batch processing effciency and reduce
throughput because it will greatly increase the number of
unnecessary context switches during IO completion processing....

> /pboffline/hwcct_prg_old/lib/fsperf -t overwrite --noclean -f 5g -b 4k /pboffline
> 
> (With sched_wake_up_granularity set to 15ms)
> 
> Performance counter stats for 'system wide' (5 runs):
> event					    v5.4 				v5.4 + patch(v3)
> probe:active_load_balance_cpu_stop       1,919  ( +-  2.89% )                     4  ( +- 20.48% )
> sched:sched_waking                     441,535  ( +-  0.17% )               914,630  ( +-  0.18% )
> sched:sched_wakeup                     441,533  ( +-  0.17% )               914,630  ( +-  0.18% )
> sched:sched_wakeup_new                   2,436  ( +-  8.08% )                   545  ( +-  4.02% )
> sched:sched_switch                     797,007  ( +-  0.26% )             1,490,261  ( +-  0.10% )
> sched:sched_migrate_task                20,998  ( +-  1.04% )                 2,492  ( +- 11.56% )

As we see here. We've doubled the number of context switches
(increased by 700,000) just to avoid 17,000 incorrect load balancer
task migrations.

That seems like we now make 700,000 incorrect decisions instead of
just 20,000. The difference is that the consequence of making these
many incorrect pre-emption decisions is vastly less than the
consequence of making the wrong migration decision.

It seems to me that we should be checking this is_per_cpu_kthread()
state for tasks queued on the runqueue during active load balancing,
rather than at wakeup time.  i.e. in these cases we don't migrate
the running task, we just let it run out it's timeslice out and the
local per-cpu kthreads then run appropriately.

AFAICT this would have the same effect of avoiding unnecessary task
migrations in this workload, but without causing a global change to
the way workqueue kworkers are scheduled that has the potential to
cause regressions in other workqueue intensive workloads....

Cheers,

Dave.
Peter Zijlstra Dec. 12, 2019, 10:10 a.m. UTC | #2
On Thu, Dec 12, 2019 at 09:46:17AM +1100, Dave Chinner wrote:
> On Wed, Dec 11, 2019 at 11:08:29PM +0530, Srikar Dronamraju wrote:
> > A running task can wake-up a per CPU bound kthread on the same CPU.
> > If the current running task doesn't yield the CPU before the next load
> > balance operation, the scheduler would detect load imbalance and try to
> > balance the load. However this load balance would fail as the waiting
> > task is CPU bound, while the running task cannot be moved by the regular
> > load balancer. Finally the active load balancer would kick in and move
> > the task to a different CPU/Core. Moving the task to a different
> > CPU/core can lead to loss in cache affinity leading to poor performance.
> > 
> > This is more prone to happen if the current running task is CPU
> > intensive and the sched_wake_up_granularity is set to larger value.
> > When the sched_wake_up_granularity was relatively small, it was observed
> > that the bound thread would complete before the load balancer would have
> > chosen to move the cache hot task to a different CPU.
> > 
> > To deal with this situation, the current running task would yield to a
> > per CPU bound kthread, provided kthread is not CPU intensive.
> 
> So a question for you here: when does the workqueue worker pre-empt
> the currently running task? Is it immediately? Or when a time-slice
> of the currently running task runs out?
> 
> We don't want queued work immediately pre-empting the task that
> queued the work - the queued work is *deferred* work that should be
> run _soon_ but we want the currently running task to finish what it
> is doing first if possible. 

Good point, something to maybe try (Srikar?) is making tick preemption
more agressive for such tasks.

The below extends the previous patch to retain the set_next_buddy() on
wakeup, but does not make the actual preemption more agressive.

Then it 'fixes' the tick preemption to better align with the actual
scheduler pick (ie. consider the buddy hints).

---
 kernel/sched/core.c  |  3 +++
 kernel/sched/fair.c  | 26 +++++++++++++++++++-------
 kernel/sched/sched.h |  3 ++-
 3 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 90e4b00ace89..75738b136ea7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2560,6 +2560,9 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 	success = 1;
 	cpu = task_cpu(p);
 
+	if (is_per_cpu_kthread(p))
+		wake_flags |= WF_KTHREAD;
+
 	/*
 	 * Ensure we load p->on_rq _after_ p->state, otherwise it would
 	 * be possible to, falsely, observe p->on_rq == 0 and get stuck
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 08a233e97a01..78e681c8c19a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4126,6 +4126,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		update_min_vruntime(cfs_rq);
 }
 
+static struct sched_entity *
+__pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr);
+
 /*
  * Preempt the current task with a newly woken task if needed:
  */
@@ -4156,13 +4159,13 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 	if (delta_exec < sysctl_sched_min_granularity)
 		return;
 
-	se = __pick_first_entity(cfs_rq);
+	se = __pick_next_entity(cfs_rq, NULL);
 	delta = curr->vruntime - se->vruntime;
 
 	if (delta < 0)
 		return;
 
-	if (delta > ideal_runtime)
+	if (delta > ideal_runtime) // maybe frob this too ?
 		resched_curr(rq_of(cfs_rq));
 }
 
@@ -4210,7 +4213,7 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
  * 4) do not run the "skip" process, if something else is available
  */
 static struct sched_entity *
-pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
+__pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 {
 	struct sched_entity *left = __pick_first_entity(cfs_rq);
 	struct sched_entity *se;
@@ -4255,8 +4258,14 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
 		se = cfs_rq->next;
 
-	clear_buddies(cfs_rq, se);
+	return se;
+}
 
+static struct sched_entity *
+pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
+{
+	struct sched_entity *se = __pick_next_entity(cfs_rq, curr);
+	clear_buddies(cfs_rq, se);
 	return se;
 }
 
@@ -6565,7 +6574,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	struct sched_entity *se = &curr->se, *pse = &p->se;
 	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
 	int scale = cfs_rq->nr_running >= sched_nr_latency;
-	int next_buddy_marked = 0;
+	int wpe, next_buddy_marked = 0;
 
 	if (unlikely(se == pse))
 		return;
@@ -6612,14 +6621,17 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	find_matching_se(&se, &pse);
 	update_curr(cfs_rq_of(se));
 	BUG_ON(!pse);
-	if (wakeup_preempt_entity(se, pse) == 1) {
+	wpe = wakeup_preempt_entity(se, pse);
+	if (wpe >= !(wake_flags & WF_KTHREAD)) {
 		/*
 		 * Bias pick_next to pick the sched entity that is
 		 * triggering this preemption.
 		 */
 		if (!next_buddy_marked)
 			set_next_buddy(pse);
-		goto preempt;
+
+		if (wpe == 1)
+			goto preempt;
 	}
 
 	return;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 280a3c735935..2ee86ef51001 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1643,7 +1643,8 @@ static inline int task_on_rq_migrating(struct task_struct *p)
  */
 #define WF_SYNC			0x01		/* Waker goes to sleep after wakeup */
 #define WF_FORK			0x02		/* Child wakeup after fork */
-#define WF_MIGRATED		0x4		/* Internal use, task got migrated */
+#define WF_MIGRATED		0x04		/* Internal use, task got migrated */
+#define WF_KTHREAD		0x08
 
 /*
  * To aid in avoiding the subversion of "niceness" due to uneven distribution
Peter Zijlstra Dec. 12, 2019, 10:14 a.m. UTC | #3
On Thu, Dec 12, 2019 at 11:10:31AM +0100, Peter Zijlstra wrote:
> @@ -4156,13 +4159,13 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>  	if (delta_exec < sysctl_sched_min_granularity)
>  		return;
>  
> -	se = __pick_first_entity(cfs_rq);
> +	se = __pick_next_entity(cfs_rq, NULL);
>  	delta = curr->vruntime - se->vruntime;
>  
>  	if (delta < 0)
>  		return;

What I mean with the below comment is, when this isn't enough, try
something like:

	if (se == cfs_rq->next)
		ideal_runtime /= 2;

to make it yield sooner to 'next' buddies. Sadly, due to the whole
cgroup mess, we can't say what actual task is on the end of it (without
doing a full hierarchy pick, which is more expensive still).

> -	if (delta > ideal_runtime)
> +	if (delta > ideal_runtime) // maybe frob this too ?
>  		resched_curr(rq_of(cfs_rq));
>  }
Peter Zijlstra Dec. 12, 2019, 10:23 a.m. UTC | #4
On Thu, Dec 12, 2019 at 11:14:24AM +0100, Peter Zijlstra wrote:
> On Thu, Dec 12, 2019 at 11:10:31AM +0100, Peter Zijlstra wrote:
> > @@ -4156,13 +4159,13 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> >  	if (delta_exec < sysctl_sched_min_granularity)
> >  		return;
> >  
> > -	se = __pick_first_entity(cfs_rq);
> > +	se = __pick_next_entity(cfs_rq, NULL);
> >  	delta = curr->vruntime - se->vruntime;
> >  
> >  	if (delta < 0)
> >  		return;
> 
> What I mean with the below comment is, when this isn't enough, try
> something like:
> 
> 	if (se == cfs_rq->next)
> 		ideal_runtime /= 2;
> 
> to make it yield sooner to 'next' buddies. Sadly, due to the whole
> cgroup mess, we can't say what actual task is on the end of it (without
> doing a full hierarchy pick, which is more expensive still).

Just for giggles, that'd look something like:

	while (!entity_is_task(se) {
		cfs_rq = group_cfs_rq(se);
		se = pick_next_entity(cfs_rq, cfs_rq->curr);
	}
	p = task_of(se);

	if (is_per_cpu_kthread(p))
		ideal_runtime /= 2;

the core-scheduling patch set includes the right primitive for this I
think, pick_task_fair().

> > -	if (delta > ideal_runtime)
> > +	if (delta > ideal_runtime) // maybe frob this too ?
> >  		resched_curr(rq_of(cfs_rq));
> >  }
Vincent Guittot Dec. 12, 2019, 11:20 a.m. UTC | #5
On Thu, 12 Dec 2019 at 11:23, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Dec 12, 2019 at 11:14:24AM +0100, Peter Zijlstra wrote:
> > On Thu, Dec 12, 2019 at 11:10:31AM +0100, Peter Zijlstra wrote:
> > > @@ -4156,13 +4159,13 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > >     if (delta_exec < sysctl_sched_min_granularity)
> > >             return;
> > >
> > > -   se = __pick_first_entity(cfs_rq);
> > > +   se = __pick_next_entity(cfs_rq, NULL);
> > >     delta = curr->vruntime - se->vruntime;
> > >
> > >     if (delta < 0)
> > >             return;
> >
> > What I mean with the below comment is, when this isn't enough, try
> > something like:
> >
> >       if (se == cfs_rq->next)
> >               ideal_runtime /= 2;
> >
> > to make it yield sooner to 'next' buddies. Sadly, due to the whole
> > cgroup mess, we can't say what actual task is on the end of it (without
> > doing a full hierarchy pick, which is more expensive still).
>
> Just for giggles, that'd look something like:
>
>         while (!entity_is_task(se) {
>                 cfs_rq = group_cfs_rq(se);
>                 se = pick_next_entity(cfs_rq, cfs_rq->curr);
>         }
>         p = task_of(se);
>
>         if (is_per_cpu_kthread(p))
>                 ideal_runtime /= 2;
>
> the core-scheduling patch set includes the right primitive for this I
> think, pick_task_fair().

why not only updating wan_gran() which is the only function which uses
sysctl_sched_wakeup_granularity ?

For per cpu kthread, we could set the gran to sched_min_granularity
instead of scaling it with thread's priority so per cpu kthread can
still preempt current ask even if sysctl_sched_wakeup_granularity is
large

>
> > > -   if (delta > ideal_runtime)
> > > +   if (delta > ideal_runtime) // maybe frob this too ?
> > >             resched_curr(rq_of(cfs_rq));
> > >  }
Peter Zijlstra Dec. 12, 2019, 1:12 p.m. UTC | #6
On Thu, Dec 12, 2019 at 12:20:01PM +0100, Vincent Guittot wrote:
> On Thu, 12 Dec 2019 at 11:23, Peter Zijlstra <peterz@infradead.org> wrote:

> > Just for giggles, that'd look something like:
> >
> >         while (!entity_is_task(se) {
> >                 cfs_rq = group_cfs_rq(se);
> >                 se = pick_next_entity(cfs_rq, cfs_rq->curr);
> >         }
> >         p = task_of(se);
> >
> >         if (is_per_cpu_kthread(p))
> >                 ideal_runtime /= 2;
> >
> > the core-scheduling patch set includes the right primitive for this I
> > think, pick_task_fair().
> 
> why not only updating wan_gran() which is the only function which uses
> sysctl_sched_wakeup_granularity ?

I don't see how, it works on se, which need not be a task.

> For per cpu kthread, we could set the gran to sched_min_granularity
> instead of scaling it with thread's priority so per cpu kthread can
> still preempt current ask even if sysctl_sched_wakeup_granularity is
> large

Also, we're not poking at wakeup preemption anymore. That, as explained
by Dave, is not the right thing to do. What we want instead is to make
tick preemption a little more agressive.

And tick based preemption currently purely looks at the leftmost entity
for each runqueue we find while iterating current. IE, it might never
even see the task we tagged with next.

Also, did I say I hates cgroups :-)
Srikar Dronamraju Dec. 12, 2019, 3:07 p.m. UTC | #7
* Peter Zijlstra <peterz@infradead.org> [2019-12-12 11:10:31]:

> 
> +static struct sched_entity *
> +__pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr);

I think we already have __pick_next_entity in kernel/sched/fair.c

static struct sched_entity *__pick_next_entity(struct sched_entity *se)
{
        struct rb_node *next = rb_next(&se->run_node);

        if (!next)
                return NULL;

        return rb_entry(next, struct sched_entity, run_node);
}

I checked in v5.5-rc1, v5.4 and tip/master too. Let me know if you were
referring to a different version of code.

So I modified the only place its called to the newer
__pick_next_entity(cfs_rq, curr); 

But wanted to verify if that's what you had in mind.
Peter Zijlstra Dec. 12, 2019, 3:15 p.m. UTC | #8
On Thu, Dec 12, 2019 at 08:37:37PM +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2019-12-12 11:10:31]:
> 
> > 
> > +static struct sched_entity *
> > +__pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr);
> 
> I think we already have __pick_next_entity in kernel/sched/fair.c

D'oh... yeah, I just wrote stuff, it never actually got near a compiler.
Srikar Dronamraju Dec. 13, 2019, 5:32 a.m. UTC | #9
* Peter Zijlstra <peterz@infradead.org> [2019-12-12 11:10:31]:

> On Thu, Dec 12, 2019 at 09:46:17AM +1100, Dave Chinner wrote:
> > On Wed, Dec 11, 2019 at 11:08:29PM +0530, Srikar Dronamraju wrote:
> 
> Good point, something to maybe try (Srikar?) is making tick preemption
> more agressive for such tasks.
> 
> The below extends the previous patch to retain the set_next_buddy() on
> wakeup, but does not make the actual preemption more agressive.
> 
> Then it 'fixes' the tick preemption to better align with the actual
> scheduler pick (ie. consider the buddy hints).
> 

Just to let you know, I tried the patch, but it doesn't help.
The results were identical to the one without the patch.

I think its probably because when we allow the task to stay on the runqueue,
it will surely lead to load_balance and so we see the active-balance kick
in.

Peter, Based on what Dave is asking for, would you be okay if we add

1. A delayed_wake_list per runqueue,
2. A new wake_up API to add tasks to this delayed wake_list
3. On schedule, tasks on the delayed_wake_list would be actually woken up.
diff mbox series

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 44123b4d14e8..2636002e4a0d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2542,6 +2542,11 @@  try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
 		goto out;
 	}
 
+#ifdef CONFIG_SMP
+	if (is_per_cpu_kthread(p))
+		wake_flags |= WF_KTHREAD;
+#endif
+
 	/*
 	 * If we are going to wake up a thread waiting for CONDITION we
 	 * need to ensure that CONDITION=1 done by the caller can not be
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 69a81a5709ff..8fe40f83804d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6716,7 +6716,7 @@  static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	find_matching_se(&se, &pse);
 	update_curr(cfs_rq_of(se));
 	BUG_ON(!pse);
-	if (wakeup_preempt_entity(se, pse) == 1) {
+	if (wakeup_preempt_entity(se, pse) >= !(wake_flags & WF_KTHREAD)) {
 		/*
 		 * Bias pick_next to pick the sched entity that is
 		 * triggering this preemption.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c8870c5bd7df..fcd1ed5af9a3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1643,7 +1643,8 @@  static inline int task_on_rq_migrating(struct task_struct *p)
  */
 #define WF_SYNC			0x01		/* Waker goes to sleep after wakeup */
 #define WF_FORK			0x02		/* Child wakeup after fork */
-#define WF_MIGRATED		0x4		/* Internal use, task got migrated */
+#define WF_MIGRATED		0x04		/* Internal use, task got migrated */
+#define WF_KTHREAD		0x08		/* Per CPU Kthread */
 
 /*
  * To aid in avoiding the subversion of "niceness" due to uneven distribution