diff mbox series

[sched_ext/for-6.11] sched, sched_ext: Replace scx_next_task_picked() with sched_class->switch_class()

Message ID ZnSp5mVp3uhYganb@slm.duckdns.org (mailing list archive)
State Superseded
Headers show
Series [sched_ext/for-6.11] sched, sched_ext: Replace scx_next_task_picked() with sched_class->switch_class() | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Tejun Heo June 20, 2024, 10:15 p.m. UTC
scx_next_task_picked() is used by sched_ext to notify the BPF scheduler when
a CPU is taken away by a task dispatched from a higher priority sched_class
so that the BPF scheduler can, e.g., punt the task[s] which was running or
were waiting for the CPU to other CPUs.

Replace the sched_ext specific hook scx_next_task_picked() with a new
sched_class operation switch_class().

The changes are straightforward and the code looks better afterwards.
However, when !CONFIG_SCHED_CLASS_EXT, this just ends up adding an unused
hook which is unlikely to be useful to other sched_classes. We can #ifdef
the op with CONFIG_SCHED_CLASS_EXT but then I'm not sure the code
necessarily looks better afterwards.

Please let me know the preference. If adding #ifdef's is preferable, that's
okay too.

Signed-off-by: Tejun Heo <tj@kernel.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
---
 kernel/sched/core.c  |    5 ++++-
 kernel/sched/ext.c   |   20 ++++++++++----------
 kernel/sched/ext.h   |    4 ----
 kernel/sched/sched.h |    2 ++
 4 files changed, 16 insertions(+), 15 deletions(-)

Comments

Linus Torvalds June 20, 2024, 10:42 p.m. UTC | #1
On Thu, 20 Jun 2024 at 15:15, Tejun Heo <tj@kernel.org> wrote:
>
> The changes are straightforward and the code looks better afterwards.
> However, when !CONFIG_SCHED_CLASS_EXT, this just ends up adding an unused
> hook which is unlikely to be useful to other sched_classes. We can #ifdef
> the op with CONFIG_SCHED_CLASS_EXT but then I'm not sure the code
> necessarily looks better afterwards.

So honestly, if people _really_ care about performance here, then I
think that in the long run the right thing to do is

 - expose all the DEFINE_SCHED_CLASS() definitions in a header file

 - rename for_each_class() to FOR_EACH_CLASS() and make it unroll the
whole damn loop statically

which would turn the indirect branches into actual direct branches,
and would statically just remove any "if (!class->zyz)" conditionals.

Pretty? No. But it probably wouldn't be hugely ugly either, and
honestly, looking at the existing for_each_class() uses (and the one
single "for_class_range()" one), they are so small and the number of
classes is so small that unrolling the loop entirely doesn't sound
bad.

It wouldn't help deal with *this* case (since it's a "call variable
class"), but considering that the current __pick_next_task()

 (a) special-cases one class as-is

 (b) does a "for_each_class()" and calls an indirect call for each
when that doesn't trigger

I would claim that people don't care enough about this that one test
for a NULL 'switch_class' function would be worth worrying about.

Btw, indirect calls are now expensive enough that when you have only a
handful of choices, instead of a variable

        class->some_callback(some_arguments);

you might literally be better off with a macro that does

       #define call_sched_fn(class, name, arg...) switch (class) { \
        case &fair_name_class: fair_name_class.name(arg); break; \
        ... unroll them all here..

which then just generates a (very small) tree of if-statements.

Again, this is entirely too ugly to do unless people *really* care.
But for situations where you have a small handful of cases known at
compile-time, it's not out of the question, and it probably does
generate better code.

NOTE NOTE NOTE! This is a comp[letely independent aside, and has
nothing to do with sched_ext except for the very obvious indirect fact
that sched_ext would be one of the classes in this kind of code.

And yes, I suspect it is too ugly to actually do this.

            Linus
Tejun Heo June 21, 2024, 7:46 p.m. UTC | #2
Hello,

On Thu, Jun 20, 2024 at 03:42:48PM -0700, Linus Torvalds wrote:
...
> Btw, indirect calls are now expensive enough that when you have only a
> handful of choices, instead of a variable
> 
>         class->some_callback(some_arguments);
> 
> you might literally be better off with a macro that does
> 
>        #define call_sched_fn(class, name, arg...) switch (class) { \
>         case &fair_name_class: fair_name_class.name(arg); break; \
>         ... unroll them all here..
> 
> which then just generates a (very small) tree of if-statements.
> 
> Again, this is entirely too ugly to do unless people *really* care.
> But for situations where you have a small handful of cases known at
> compile-time, it's not out of the question, and it probably does
> generate better code.

I'll update the patch description to point to the previous message just in
case and apply it to sched_ext/for-6.11.

Thanks.
Peter Zijlstra June 24, 2024, 8:59 a.m. UTC | #3
On Thu, Jun 20, 2024 at 12:15:02PM -1000, Tejun Heo wrote:
> scx_next_task_picked() is used by sched_ext to notify the BPF scheduler when
> a CPU is taken away by a task dispatched from a higher priority sched_class
> so that the BPF scheduler can, e.g., punt the task[s] which was running or
> were waiting for the CPU to other CPUs.
> 
> Replace the sched_ext specific hook scx_next_task_picked() with a new
> sched_class operation switch_class().
> 
> The changes are straightforward and the code looks better afterwards.
> However, when !CONFIG_SCHED_CLASS_EXT, this just ends up adding an unused
> hook which is unlikely to be useful to other sched_classes. We can #ifdef
> the op with CONFIG_SCHED_CLASS_EXT but then I'm not sure the code
> necessarily looks better afterwards.
> 
> Please let me know the preference. If adding #ifdef's is preferable, that's
> okay too.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> ---
>  kernel/sched/core.c  |    5 ++++-
>  kernel/sched/ext.c   |   20 ++++++++++----------
>  kernel/sched/ext.h   |    4 ----
>  kernel/sched/sched.h |    2 ++
>  4 files changed, 16 insertions(+), 15 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5907,7 +5907,10 @@ restart:
>  	for_each_active_class(class) {
>  		p = class->pick_next_task(rq);
>  		if (p) {
> -			scx_next_task_picked(rq, p, class);
> +			const struct sched_class *prev_class = prev->sched_class;
> +
> +			if (class != prev_class && prev_class->switch_class)
> +				prev_class->switch_class(rq, p);
>  			return p;
>  		}
>  	}

I would much rather see sched_class::pick_next_task() get an extra
argument so that the BPF thing can do what it needs in there and we can
avoid this extra code here.
Peter Zijlstra June 24, 2024, 9:02 a.m. UTC | #4
On Thu, Jun 20, 2024 at 03:42:48PM -0700, Linus Torvalds wrote:
> Btw, indirect calls are now expensive enough that when you have only a
> handful of choices, instead of a variable
> 
>         class->some_callback(some_arguments);
> 
> you might literally be better off with a macro that does
> 
>        #define call_sched_fn(class, name, arg...) switch (class) { \
>         case &fair_name_class: fair_name_class.name(arg); break; \
>         ... unroll them all here..
> 
> which then just generates a (very small) tree of if-statements.
> 
> Again, this is entirely too ugly to do unless people *really* care.
> But for situations where you have a small handful of cases known at
> compile-time, it's not out of the question, and it probably does
> generate better code.
> 
> NOTE NOTE NOTE! This is a comp[letely independent aside, and has
> nothing to do with sched_ext except for the very obvious indirect fact
> that sched_ext would be one of the classes in this kind of code.
> 
> And yes, I suspect it is too ugly to actually do this.

Very early on in the retpoline mess I briefly considered doing this, but
I decided against doing the ugly until someone came with numbers bad
enough to warrant them.

We're now many years later and I'm very glad we never really *had* to go
down that route.
Peter Zijlstra June 24, 2024, 9:04 a.m. UTC | #5
On Fri, Jun 21, 2024 at 09:46:56AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Thu, Jun 20, 2024 at 03:42:48PM -0700, Linus Torvalds wrote:
> ...
> > Btw, indirect calls are now expensive enough that when you have only a
> > handful of choices, instead of a variable
> > 
> >         class->some_callback(some_arguments);
> > 
> > you might literally be better off with a macro that does
> > 
> >        #define call_sched_fn(class, name, arg...) switch (class) { \
> >         case &fair_name_class: fair_name_class.name(arg); break; \
> >         ... unroll them all here..
> > 
> > which then just generates a (very small) tree of if-statements.
> > 
> > Again, this is entirely too ugly to do unless people *really* care.
> > But for situations where you have a small handful of cases known at
> > compile-time, it's not out of the question, and it probably does
> > generate better code.
> 
> I'll update the patch description to point to the previous message just in
> case and apply it to sched_ext/for-6.11.

Can you please back merge and keep it a sane series? I'm going to have
to review it (even though I still very strongly disagree with the whole
thing) and there really is nothing worse than a series that introduces
things only to remove/change them again later.
Tejun Heo June 24, 2024, 6:41 p.m. UTC | #6
Hello, Peter.

On Mon, Jun 24, 2024 at 11:04:31AM +0200, Peter Zijlstra wrote:
> > I'll update the patch description to point to the previous message just in
> > case and apply it to sched_ext/for-6.11.
> 
> Can you please back merge and keep it a sane series? I'm going to have
> to review it (even though I still very strongly disagree with the whole
> thing) and there really is nothing worse than a series that introduces
> things only to remove/change them again later.

I started the sched_ext/for-6.11 branch last week with the v7 patchset:

  http://lkml.kernel.org/r/20240618212056.2833381-1-tj@kernel.org

and would much prefer to just run it as a normal branch. No matter what form
the patchset gets applied in, we'll have constant flow of changes, fixes,
reverts and improvements no matter what, so I'd much prefer to avoid another
round of reshuffling and reposting.

Thanks.
Tejun Heo June 24, 2024, 9:01 p.m. UTC | #7
Hello, Peter.

On Mon, Jun 24, 2024 at 10:59:27AM +0200, Peter Zijlstra wrote:
> > @@ -5907,7 +5907,10 @@ restart:
> >  	for_each_active_class(class) {
> >  		p = class->pick_next_task(rq);
> >  		if (p) {
> > -			scx_next_task_picked(rq, p, class);
> > +			const struct sched_class *prev_class = prev->sched_class;
> > +
> > +			if (class != prev_class && prev_class->switch_class)
> > +				prev_class->switch_class(rq, p);
> 
> I would much rather see sched_class::pick_next_task() get an extra
> argument so that the BPF thing can do what it needs in there and we can
> avoid this extra code here.

Hmm... but here, the previous class's ->pick_next_task() might not be called
at all, so I'm not sure how that'd work. For context, sched_ext is using
this to tell the BPF scheduler that it lost a CPU to a higher priority class
(be that RT or CFS) os that the BPF scheduler can respond if necessary (e.g.
punting tasks that were queued on that CPU somewhere else and so on).

Imagine a case where a sched_ext task was running but then a RT task wakes
up on the CPU. We'd enter the scheduling path, RT's pick_next_task() would
return the new RT task to run. We now need to tell the BPF scheduler that we
lost the CPU to the RT task but haven't called its pick_next_task() yet.

Thanks.
Peter Zijlstra June 25, 2024, 7:49 a.m. UTC | #8
On Mon, Jun 24, 2024 at 11:01:10AM -1000, Tejun Heo wrote:
> Hello, Peter.
> 
> On Mon, Jun 24, 2024 at 10:59:27AM +0200, Peter Zijlstra wrote:
> > > @@ -5907,7 +5907,10 @@ restart:
> > >  	for_each_active_class(class) {
> > >  		p = class->pick_next_task(rq);
> > >  		if (p) {
> > > -			scx_next_task_picked(rq, p, class);
> > > +			const struct sched_class *prev_class = prev->sched_class;
> > > +
> > > +			if (class != prev_class && prev_class->switch_class)
> > > +				prev_class->switch_class(rq, p);
> > 
> > I would much rather see sched_class::pick_next_task() get an extra
> > argument so that the BPF thing can do what it needs in there and we can
> > avoid this extra code here.
> 
> Hmm... but here, the previous class's ->pick_next_task() might not be called
> at all, so I'm not sure how that'd work. For context, sched_ext is using
> this to tell the BPF scheduler that it lost a CPU to a higher priority class
> (be that RT or CFS) os that the BPF scheduler can respond if necessary (e.g.
> punting tasks that were queued on that CPU somewhere else and so on).
> 
> Imagine a case where a sched_ext task was running but then a RT task wakes
> up on the CPU. We'd enter the scheduling path, RT's pick_next_task() would
> return the new RT task to run. We now need to tell the BPF scheduler that we
> lost the CPU to the RT task but haven't called its pick_next_task() yet.

Bah, I got it backwards indeed. But in this case, don't you also need
something in pick_task() -- the whole core scheduling thing does much
the same.
Tejun Heo June 25, 2024, 11:30 p.m. UTC | #9
Hello,

On Tue, Jun 25, 2024 at 09:49:35AM +0200, Peter Zijlstra wrote:
> > Imagine a case where a sched_ext task was running but then a RT task wakes
> > up on the CPU. We'd enter the scheduling path, RT's pick_next_task() would
> > return the new RT task to run. We now need to tell the BPF scheduler that we
> > lost the CPU to the RT task but haven't called its pick_next_task() yet.
> 
> Bah, I got it backwards indeed. But in this case, don't you also need
> something in pick_task() -- the whole core scheduling thing does much
> the same.

Yes, indeed we do, but because we're dispatching from the balance path, the
cpu_acquire method is being called from there. Because who was running on
the CPU before us is less interesting, @prev is not passed into
cpu_acquire() but if that becomes necessary, it's already available there
too.

Thanks.
Peter Zijlstra June 26, 2024, 8:28 a.m. UTC | #10
On Tue, Jun 25, 2024 at 01:30:05PM -1000, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jun 25, 2024 at 09:49:35AM +0200, Peter Zijlstra wrote:
> > > Imagine a case where a sched_ext task was running but then a RT task wakes
> > > up on the CPU. We'd enter the scheduling path, RT's pick_next_task() would
> > > return the new RT task to run. We now need to tell the BPF scheduler that we
> > > lost the CPU to the RT task but haven't called its pick_next_task() yet.
> > 
> > Bah, I got it backwards indeed. But in this case, don't you also need
> > something in pick_task() -- the whole core scheduling thing does much
> > the same.
> 
> Yes, indeed we do, but because we're dispatching from the balance path, the
> cpu_acquire method is being called from there. Because who was running on
> the CPU before us is less interesting, @prev is not passed into
> cpu_acquire() but if that becomes necessary, it's already available there
> too.

I suppose I need to read more, because I'm not knowing what cpu_acquire
is :/ I do know I don't much like the asymmetry here, but maybe it makes
sense, dunno.
Tejun Heo June 26, 2024, 5:56 p.m. UTC | #11
Hello,

On Wed, Jun 26, 2024 at 10:28:48AM +0200, Peter Zijlstra wrote:
> I suppose I need to read more, because I'm not knowing what cpu_acquire
> is :/ I do know I don't much like the asymmetry here, but maybe it makes
> sense, dunno.

The sched_ext ops are symmetric - ops.cpu_release() is called when SCX loses
CPU to a higher priority sched class and ops.cpu_acquire() when the CPU
returns to SCX afterwards. Where they hook into is not symmetric. The class
which picks the next task already knows the previous task, so there's no
need to add anything. However, without the new sched_class->switch_class(),
the previous class has no way of knowing, so they're a bit different.

Thanks.
diff mbox series

Patch

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5907,7 +5907,10 @@  restart:
 	for_each_active_class(class) {
 		p = class->pick_next_task(rq);
 		if (p) {
-			scx_next_task_picked(rq, p, class);
+			const struct sched_class *prev_class = prev->sched_class;
+
+			if (class != prev_class && prev_class->switch_class)
+				prev_class->switch_class(rq, p);
 			return p;
 		}
 	}
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2749,10 +2749,9 @@  preempt_reason_from_class(const struct s
 	return SCX_CPU_PREEMPT_UNKNOWN;
 }
 
-void scx_next_task_picked(struct rq *rq, struct task_struct *p,
-			  const struct sched_class *active)
+static void switch_class_scx(struct rq *rq, struct task_struct *next)
 {
-	lockdep_assert_rq_held(rq);
+	const struct sched_class *next_class = next->sched_class;
 
 	if (!scx_enabled())
 		return;
@@ -2769,12 +2768,11 @@  void scx_next_task_picked(struct rq *rq,
 
 	/*
 	 * The callback is conceptually meant to convey that the CPU is no
-	 * longer under the control of SCX. Therefore, don't invoke the
-	 * callback if the CPU is is staying on SCX, or going idle (in which
-	 * case the SCX scheduler has actively decided not to schedule any
-	 * tasks on the CPU).
+	 * longer under the control of SCX. Therefore, don't invoke the callback
+	 * if the next class is below SCX (in which case the BPF scheduler has
+	 * actively decided not to schedule any tasks on the CPU).
 	 */
-	if (likely(active >= &ext_sched_class))
+	if (sched_class_above(&ext_sched_class, next_class))
 		return;
 
 	/*
@@ -2789,8 +2787,8 @@  void scx_next_task_picked(struct rq *rq,
 	if (!rq->scx.cpu_released) {
 		if (SCX_HAS_OP(cpu_release)) {
 			struct scx_cpu_release_args args = {
-				.reason = preempt_reason_from_class(active),
-				.task = p,
+				.reason = preempt_reason_from_class(next_class),
+				.task = next,
 			};
 
 			SCX_CALL_OP(SCX_KF_CPU_RELEASE,
@@ -3496,6 +3494,8 @@  DEFINE_SCHED_CLASS(ext) = {
 	.put_prev_task		= put_prev_task_scx,
 	.set_next_task		= set_next_task_scx,
 
+	.switch_class		= switch_class_scx,
+
 #ifdef CONFIG_SMP
 	.balance		= balance_scx,
 	.select_task_rq		= select_task_rq_scx,
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -33,8 +33,6 @@  static inline bool task_on_scx(const str
 	return scx_enabled() && p->sched_class == &ext_sched_class;
 }
 
-void scx_next_task_picked(struct rq *rq, struct task_struct *p,
-			  const struct sched_class *active);
 void scx_tick(struct rq *rq);
 void init_scx_entity(struct sched_ext_entity *scx);
 void scx_pre_fork(struct task_struct *p);
@@ -82,8 +80,6 @@  bool scx_prio_less(const struct task_str
 #define scx_enabled()		false
 #define scx_switched_all()	false
 
-static inline void scx_next_task_picked(struct rq *rq, struct task_struct *p,
-					const struct sched_class *active) {}
 static inline void scx_tick(struct rq *rq) {}
 static inline void scx_pre_fork(struct task_struct *p) {}
 static inline int scx_fork(struct task_struct *p) { return 0; }
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2355,6 +2355,8 @@  struct sched_class {
 	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
 	void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first);
 
+	void (*switch_class)(struct rq *rq, struct task_struct *next);
+
 #ifdef CONFIG_SMP
 	int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int flags);