Message ID | ZnSp5mVp3uhYganb@slm.duckdns.org (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [sched_ext/for-6.11] sched, sched_ext: Replace scx_next_task_picked() with sched_class->switch_class() | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
On Thu, 20 Jun 2024 at 15:15, Tejun Heo <tj@kernel.org> wrote: > > The changes are straightforward and the code looks better afterwards. > However, when !CONFIG_SCHED_CLASS_EXT, this just ends up adding an unused > hook which is unlikely to be useful to other sched_classes. We can #ifdef > the op with CONFIG_SCHED_CLASS_EXT but then I'm not sure the code > necessarily looks better afterwards. So honestly, if people _really_ care about performance here, then I think that in the long run the right thing to do is - expose all the DEFINE_SCHED_CLASS() definitions in a header file - rename for_each_class() to FOR_EACH_CLASS() and make it unroll the whole damn loop statically which would turn the indirect branches into actual direct branches, and would statically just remove any "if (!class->zyz)" conditionals. Pretty? No. But it probably wouldn't be hugely ugly either, and honestly, looking at the existing for_each_class() uses (and the one single "for_class_range()" one), they are so small and the number of classes is so small that unrolling the loop entirely doesn't sound bad. It wouldn't help deal with *this* case (since it's a "call variable class"), but considering that the current __pick_next_task() (a) special-cases one class as-is (b) does a "for_each_class()" and calls an indirect call for each when that doesn't trigger I would claim that people don't care enough about this that one test for a NULL 'switch_class' function would be worth worrying about. Btw, indirect calls are now expensive enough that when you have only a handful of choices, instead of a variable class->some_callback(some_arguments); you might literally be better off with a macro that does #define call_sched_fn(class, name, arg...) switch (class) { \ case &fair_name_class: fair_name_class.name(arg); break; \ ... unroll them all here.. which then just generates a (very small) tree of if-statements. Again, this is entirely too ugly to do unless people *really* care. But for situations where you have a small handful of cases known at compile-time, it's not out of the question, and it probably does generate better code. NOTE NOTE NOTE! This is a comp[letely independent aside, and has nothing to do with sched_ext except for the very obvious indirect fact that sched_ext would be one of the classes in this kind of code. And yes, I suspect it is too ugly to actually do this. Linus
Hello, On Thu, Jun 20, 2024 at 03:42:48PM -0700, Linus Torvalds wrote: ... > Btw, indirect calls are now expensive enough that when you have only a > handful of choices, instead of a variable > > class->some_callback(some_arguments); > > you might literally be better off with a macro that does > > #define call_sched_fn(class, name, arg...) switch (class) { \ > case &fair_name_class: fair_name_class.name(arg); break; \ > ... unroll them all here.. > > which then just generates a (very small) tree of if-statements. > > Again, this is entirely too ugly to do unless people *really* care. > But for situations where you have a small handful of cases known at > compile-time, it's not out of the question, and it probably does > generate better code. I'll update the patch description to point to the previous message just in case and apply it to sched_ext/for-6.11. Thanks.
On Thu, Jun 20, 2024 at 12:15:02PM -1000, Tejun Heo wrote: > scx_next_task_picked() is used by sched_ext to notify the BPF scheduler when > a CPU is taken away by a task dispatched from a higher priority sched_class > so that the BPF scheduler can, e.g., punt the task[s] which was running or > were waiting for the CPU to other CPUs. > > Replace the sched_ext specific hook scx_next_task_picked() with a new > sched_class operation switch_class(). > > The changes are straightforward and the code looks better afterwards. > However, when !CONFIG_SCHED_CLASS_EXT, this just ends up adding an unused > hook which is unlikely to be useful to other sched_classes. We can #ifdef > the op with CONFIG_SCHED_CLASS_EXT but then I'm not sure the code > necessarily looks better afterwards. > > Please let me know the preference. If adding #ifdef's is preferable, that's > okay too. > > Signed-off-by: Tejun Heo <tj@kernel.org> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Thomas Gleixner <tglx@linutronix.de> > --- > kernel/sched/core.c | 5 ++++- > kernel/sched/ext.c | 20 ++++++++++---------- > kernel/sched/ext.h | 4 ---- > kernel/sched/sched.h | 2 ++ > 4 files changed, 16 insertions(+), 15 deletions(-) > > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -5907,7 +5907,10 @@ restart: > for_each_active_class(class) { > p = class->pick_next_task(rq); > if (p) { > - scx_next_task_picked(rq, p, class); > + const struct sched_class *prev_class = prev->sched_class; > + > + if (class != prev_class && prev_class->switch_class) > + prev_class->switch_class(rq, p); > return p; > } > } I would much rather see sched_class::pick_next_task() get an extra argument so that the BPF thing can do what it needs in there and we can avoid this extra code here.
On Thu, Jun 20, 2024 at 03:42:48PM -0700, Linus Torvalds wrote: > Btw, indirect calls are now expensive enough that when you have only a > handful of choices, instead of a variable > > class->some_callback(some_arguments); > > you might literally be better off with a macro that does > > #define call_sched_fn(class, name, arg...) switch (class) { \ > case &fair_name_class: fair_name_class.name(arg); break; \ > ... unroll them all here.. > > which then just generates a (very small) tree of if-statements. > > Again, this is entirely too ugly to do unless people *really* care. > But for situations where you have a small handful of cases known at > compile-time, it's not out of the question, and it probably does > generate better code. > > NOTE NOTE NOTE! This is a comp[letely independent aside, and has > nothing to do with sched_ext except for the very obvious indirect fact > that sched_ext would be one of the classes in this kind of code. > > And yes, I suspect it is too ugly to actually do this. Very early on in the retpoline mess I briefly considered doing this, but I decided against doing the ugly until someone came with numbers bad enough to warrant them. We're now many years later and I'm very glad we never really *had* to go down that route.
On Fri, Jun 21, 2024 at 09:46:56AM -1000, Tejun Heo wrote: > Hello, > > On Thu, Jun 20, 2024 at 03:42:48PM -0700, Linus Torvalds wrote: > ... > > Btw, indirect calls are now expensive enough that when you have only a > > handful of choices, instead of a variable > > > > class->some_callback(some_arguments); > > > > you might literally be better off with a macro that does > > > > #define call_sched_fn(class, name, arg...) switch (class) { \ > > case &fair_name_class: fair_name_class.name(arg); break; \ > > ... unroll them all here.. > > > > which then just generates a (very small) tree of if-statements. > > > > Again, this is entirely too ugly to do unless people *really* care. > > But for situations where you have a small handful of cases known at > > compile-time, it's not out of the question, and it probably does > > generate better code. > > I'll update the patch description to point to the previous message just in > case and apply it to sched_ext/for-6.11. Can you please back merge and keep it a sane series? I'm going to have to review it (even though I still very strongly disagree with the whole thing) and there really is nothing worse than a series that introduces things only to remove/change them again later.
Hello, Peter. On Mon, Jun 24, 2024 at 11:04:31AM +0200, Peter Zijlstra wrote: > > I'll update the patch description to point to the previous message just in > > case and apply it to sched_ext/for-6.11. > > Can you please back merge and keep it a sane series? I'm going to have > to review it (even though I still very strongly disagree with the whole > thing) and there really is nothing worse than a series that introduces > things only to remove/change them again later. I started the sched_ext/for-6.11 branch last week with the v7 patchset: http://lkml.kernel.org/r/20240618212056.2833381-1-tj@kernel.org and would much prefer to just run it as a normal branch. No matter what form the patchset gets applied in, we'll have constant flow of changes, fixes, reverts and improvements no matter what, so I'd much prefer to avoid another round of reshuffling and reposting. Thanks.
Hello, Peter. On Mon, Jun 24, 2024 at 10:59:27AM +0200, Peter Zijlstra wrote: > > @@ -5907,7 +5907,10 @@ restart: > > for_each_active_class(class) { > > p = class->pick_next_task(rq); > > if (p) { > > - scx_next_task_picked(rq, p, class); > > + const struct sched_class *prev_class = prev->sched_class; > > + > > + if (class != prev_class && prev_class->switch_class) > > + prev_class->switch_class(rq, p); > > I would much rather see sched_class::pick_next_task() get an extra > argument so that the BPF thing can do what it needs in there and we can > avoid this extra code here. Hmm... but here, the previous class's ->pick_next_task() might not be called at all, so I'm not sure how that'd work. For context, sched_ext is using this to tell the BPF scheduler that it lost a CPU to a higher priority class (be that RT or CFS) os that the BPF scheduler can respond if necessary (e.g. punting tasks that were queued on that CPU somewhere else and so on). Imagine a case where a sched_ext task was running but then a RT task wakes up on the CPU. We'd enter the scheduling path, RT's pick_next_task() would return the new RT task to run. We now need to tell the BPF scheduler that we lost the CPU to the RT task but haven't called its pick_next_task() yet. Thanks.
On Mon, Jun 24, 2024 at 11:01:10AM -1000, Tejun Heo wrote: > Hello, Peter. > > On Mon, Jun 24, 2024 at 10:59:27AM +0200, Peter Zijlstra wrote: > > > @@ -5907,7 +5907,10 @@ restart: > > > for_each_active_class(class) { > > > p = class->pick_next_task(rq); > > > if (p) { > > > - scx_next_task_picked(rq, p, class); > > > + const struct sched_class *prev_class = prev->sched_class; > > > + > > > + if (class != prev_class && prev_class->switch_class) > > > + prev_class->switch_class(rq, p); > > > > I would much rather see sched_class::pick_next_task() get an extra > > argument so that the BPF thing can do what it needs in there and we can > > avoid this extra code here. > > Hmm... but here, the previous class's ->pick_next_task() might not be called > at all, so I'm not sure how that'd work. For context, sched_ext is using > this to tell the BPF scheduler that it lost a CPU to a higher priority class > (be that RT or CFS) os that the BPF scheduler can respond if necessary (e.g. > punting tasks that were queued on that CPU somewhere else and so on). > > Imagine a case where a sched_ext task was running but then a RT task wakes > up on the CPU. We'd enter the scheduling path, RT's pick_next_task() would > return the new RT task to run. We now need to tell the BPF scheduler that we > lost the CPU to the RT task but haven't called its pick_next_task() yet. Bah, I got it backwards indeed. But in this case, don't you also need something in pick_task() -- the whole core scheduling thing does much the same.
Hello, On Tue, Jun 25, 2024 at 09:49:35AM +0200, Peter Zijlstra wrote: > > Imagine a case where a sched_ext task was running but then a RT task wakes > > up on the CPU. We'd enter the scheduling path, RT's pick_next_task() would > > return the new RT task to run. We now need to tell the BPF scheduler that we > > lost the CPU to the RT task but haven't called its pick_next_task() yet. > > Bah, I got it backwards indeed. But in this case, don't you also need > something in pick_task() -- the whole core scheduling thing does much > the same. Yes, indeed we do, but because we're dispatching from the balance path, the cpu_acquire method is being called from there. Because who was running on the CPU before us is less interesting, @prev is not passed into cpu_acquire() but if that becomes necessary, it's already available there too. Thanks.
On Tue, Jun 25, 2024 at 01:30:05PM -1000, Tejun Heo wrote: > Hello, > > On Tue, Jun 25, 2024 at 09:49:35AM +0200, Peter Zijlstra wrote: > > > Imagine a case where a sched_ext task was running but then a RT task wakes > > > up on the CPU. We'd enter the scheduling path, RT's pick_next_task() would > > > return the new RT task to run. We now need to tell the BPF scheduler that we > > > lost the CPU to the RT task but haven't called its pick_next_task() yet. > > > > Bah, I got it backwards indeed. But in this case, don't you also need > > something in pick_task() -- the whole core scheduling thing does much > > the same. > > Yes, indeed we do, but because we're dispatching from the balance path, the > cpu_acquire method is being called from there. Because who was running on > the CPU before us is less interesting, @prev is not passed into > cpu_acquire() but if that becomes necessary, it's already available there > too. I suppose I need to read more, because I'm not knowing what cpu_acquire is :/ I do know I don't much like the asymmetry here, but maybe it makes sense, dunno.
Hello, On Wed, Jun 26, 2024 at 10:28:48AM +0200, Peter Zijlstra wrote: > I suppose I need to read more, because I'm not knowing what cpu_acquire > is :/ I do know I don't much like the asymmetry here, but maybe it makes > sense, dunno. The sched_ext ops are symmetric - ops.cpu_release() is called when SCX loses CPU to a higher priority sched class and ops.cpu_acquire() when the CPU returns to SCX afterwards. Where they hook into is not symmetric. The class which picks the next task already knows the previous task, so there's no need to add anything. However, without the new sched_class->switch_class(), the previous class has no way of knowing, so they're a bit different. Thanks.
--- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5907,7 +5907,10 @@ restart: for_each_active_class(class) { p = class->pick_next_task(rq); if (p) { - scx_next_task_picked(rq, p, class); + const struct sched_class *prev_class = prev->sched_class; + + if (class != prev_class && prev_class->switch_class) + prev_class->switch_class(rq, p); return p; } } --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -2749,10 +2749,9 @@ preempt_reason_from_class(const struct s return SCX_CPU_PREEMPT_UNKNOWN; } -void scx_next_task_picked(struct rq *rq, struct task_struct *p, - const struct sched_class *active) +static void switch_class_scx(struct rq *rq, struct task_struct *next) { - lockdep_assert_rq_held(rq); + const struct sched_class *next_class = next->sched_class; if (!scx_enabled()) return; @@ -2769,12 +2768,11 @@ void scx_next_task_picked(struct rq *rq, /* * The callback is conceptually meant to convey that the CPU is no - * longer under the control of SCX. Therefore, don't invoke the - * callback if the CPU is is staying on SCX, or going idle (in which - * case the SCX scheduler has actively decided not to schedule any - * tasks on the CPU). + * longer under the control of SCX. Therefore, don't invoke the callback + * if the next class is below SCX (in which case the BPF scheduler has + * actively decided not to schedule any tasks on the CPU). */ - if (likely(active >= &ext_sched_class)) + if (sched_class_above(&ext_sched_class, next_class)) return; /* @@ -2789,8 +2787,8 @@ void scx_next_task_picked(struct rq *rq, if (!rq->scx.cpu_released) { if (SCX_HAS_OP(cpu_release)) { struct scx_cpu_release_args args = { - .reason = preempt_reason_from_class(active), - .task = p, + .reason = preempt_reason_from_class(next_class), + .task = next, }; SCX_CALL_OP(SCX_KF_CPU_RELEASE, @@ -3496,6 +3494,8 @@ DEFINE_SCHED_CLASS(ext) = { .put_prev_task = put_prev_task_scx, .set_next_task = set_next_task_scx, + .switch_class = switch_class_scx, + #ifdef CONFIG_SMP .balance = balance_scx, .select_task_rq = select_task_rq_scx, --- a/kernel/sched/ext.h +++ b/kernel/sched/ext.h @@ -33,8 +33,6 @@ static inline bool task_on_scx(const str return scx_enabled() && p->sched_class == &ext_sched_class; } -void scx_next_task_picked(struct rq *rq, struct task_struct *p, - const struct sched_class *active); void scx_tick(struct rq *rq); void init_scx_entity(struct sched_ext_entity *scx); void scx_pre_fork(struct task_struct *p); @@ -82,8 +80,6 @@ bool scx_prio_less(const struct task_str #define scx_enabled() false #define scx_switched_all() false -static inline void scx_next_task_picked(struct rq *rq, struct task_struct *p, - const struct sched_class *active) {} static inline void scx_tick(struct rq *rq) {} static inline void scx_pre_fork(struct task_struct *p) {} static inline int scx_fork(struct task_struct *p) { return 0; } --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -2355,6 +2355,8 @@ struct sched_class { void (*put_prev_task)(struct rq *rq, struct task_struct *p); void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first); + void (*switch_class)(struct rq *rq, struct task_struct *next); + #ifdef CONFIG_SMP int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf); int (*select_task_rq)(struct task_struct *p, int task_cpu, int flags);
scx_next_task_picked() is used by sched_ext to notify the BPF scheduler when a CPU is taken away by a task dispatched from a higher priority sched_class so that the BPF scheduler can, e.g., punt the task[s] which was running or were waiting for the CPU to other CPUs. Replace the sched_ext specific hook scx_next_task_picked() with a new sched_class operation switch_class(). The changes are straightforward and the code looks better afterwards. However, when !CONFIG_SCHED_CLASS_EXT, this just ends up adding an unused hook which is unlikely to be useful to other sched_classes. We can #ifdef the op with CONFIG_SCHED_CLASS_EXT but then I'm not sure the code necessarily looks better afterwards. Please let me know the preference. If adding #ifdef's is preferable, that's okay too. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> --- kernel/sched/core.c | 5 ++++- kernel/sched/ext.c | 20 ++++++++++---------- kernel/sched/ext.h | 4 ---- kernel/sched/sched.h | 2 ++ 4 files changed, 16 insertions(+), 15 deletions(-)