Message ID | 20240619031250.2936087-3-tj@kernel.org (mailing list archive) |
---|---|
State | Handled Elsewhere, archived |
Headers | show |
Series | [1/2] cpufreq_schedutil: Refactor sugov_cpu_is_busy() | expand |
On 6/19/24 04:12, Tejun Heo wrote: > sched_ext currently does not integrate with schedutil. When schedutil is the > governor, frequencies are left unregulated and usually get stuck close to > the highest performance level from running RT tasks. > > Add CPU performance monitoring and scaling support by integrating into > schedutil. The following kfuncs are added: > > - scx_bpf_cpuperf_cap(): Query the relative performance capacity of > different CPUs in the system. > > - scx_bpf_cpuperf_cur(): Query the current performance level of a CPU > relative to its max performance. > > - scx_bpf_cpuperf_set(): Set the current target performance level of a CPU. > > This gives direct control over CPU performance setting to the BPF scheduler. > The only changes on the schedutil side are accounting for the utilization > factor from sched_ext and disabling frequency holding heuristics as it may > not apply well to sched_ext schedulers which may have a lot weaker > connection between tasks and their current / last CPU. > > With cpuperf support added, there is no reason to block uclamp. Enable while > at it. > > A toy implementation of cpuperf is added to scx_qmap as a demonstration of > the feature. > > Signed-off-by: Tejun Heo <tj@kernel.org> > Reviewed-by: David Vernet <dvernet@meta.com> > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > Cc: Viresh Kumar <viresh.kumar@linaro.org> > --- > kernel/sched/cpufreq_schedutil.c | 12 +- > kernel/sched/ext.c | 83 ++++++++++++- > kernel/sched/ext.h | 9 ++ > kernel/sched/sched.h | 1 + > tools/sched_ext/include/scx/common.bpf.h | 3 + > tools/sched_ext/scx_qmap.bpf.c | 142 ++++++++++++++++++++++- > tools/sched_ext/scx_qmap.c | 8 ++ > 7 files changed, 252 insertions(+), 6 deletions(-) > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c > index 972b7dd65af2..12174c0137a5 100644 > --- a/kernel/sched/cpufreq_schedutil.c > +++ b/kernel/sched/cpufreq_schedutil.c > @@ -197,7 +197,9 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual, > > static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost) > { > - unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu); > + unsigned long min, max; > + unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu) + > + scx_cpuperf_target(sg_cpu->cpu); What does cpu_util_cfs_boost() contain if scx is active? NIT: reverse xmas > > util = effective_cpu_util(sg_cpu->cpu, util, &min, &max); > util = max(util, boost); > @@ -330,6 +332,14 @@ static bool sugov_hold_freq(struct sugov_cpu *sg_cpu) > unsigned long idle_calls; > bool ret; > > + /* > + * The heuristics in this function is for the fair class. For SCX, the > + * performance target comes directly from the BPF scheduler. Let's just > + * follow it. > + */ > + if (scx_switched_all()) > + return false; > + > [SNIP]
Hello, On Wed, Jun 19, 2024 at 03:07:51PM +0100, Christian Loehle wrote: > > static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost) > > { > > - unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu); > > + unsigned long min, max; > > + unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu) + > > + scx_cpuperf_target(sg_cpu->cpu); > > What does cpu_util_cfs_boost() contain if scx is active? Hmm... when SCX is in partial mode, cpu_util_cfs_boost() reports the util metric from cfs side as usual, but, yeah, when scx_switched_all(), this could be stale. I'll update. > NIT: reverse xmas Ah, let me just make the initial assignment a separate statement. Reverse xmas is awkward with broken lines. Thanks.
On 19/06/2024 04:12, Tejun Heo wrote: > sched_ext currently does not integrate with schedutil. When schedutil is the > governor, frequencies are left unregulated and usually get stuck close to > the highest performance level from running RT tasks. > > Add CPU performance monitoring and scaling support by integrating into > schedutil. The following kfuncs are added: > > - scx_bpf_cpuperf_cap(): Query the relative performance capacity of > different CPUs in the system. > > - scx_bpf_cpuperf_cur(): Query the current performance level of a CPU > relative to its max performance. > > - scx_bpf_cpuperf_set(): Set the current target performance level of a CPU. > > This gives direct control over CPU performance setting to the BPF scheduler. > The only changes on the schedutil side are accounting for the utilization > factor from sched_ext and disabling frequency holding heuristics as it may > not apply well to sched_ext schedulers which may have a lot weaker > connection between tasks and their current / last CPU. > > With cpuperf support added, there is no reason to block uclamp. Enable while > at it. > > A toy implementation of cpuperf is added to scx_qmap as a demonstration of > the feature. > > Signed-off-by: Tejun Heo <tj@kernel.org> > Reviewed-by: David Vernet <dvernet@meta.com> > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > Cc: Viresh Kumar <viresh.kumar@linaro.org> > --- > kernel/sched/cpufreq_schedutil.c | 12 +- > kernel/sched/ext.c | 83 ++++++++++++- > kernel/sched/ext.h | 9 ++ > kernel/sched/sched.h | 1 + > tools/sched_ext/include/scx/common.bpf.h | 3 + > tools/sched_ext/scx_qmap.bpf.c | 142 ++++++++++++++++++++++- > tools/sched_ext/scx_qmap.c | 8 ++ > 7 files changed, 252 insertions(+), 6 deletions(-) > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c > index 972b7dd65af2..12174c0137a5 100644 > --- a/kernel/sched/cpufreq_schedutil.c > +++ b/kernel/sched/cpufreq_schedutil.c > @@ -197,7 +197,9 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual, > > static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost) > { > - unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu); > + unsigned long min, max; > + unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu) + > + scx_cpuperf_target(sg_cpu->cpu); > > util = effective_cpu_util(sg_cpu->cpu, util, &min, &max); > util = max(util, boost); > @@ -330,6 +332,14 @@ static bool sugov_hold_freq(struct sugov_cpu *sg_cpu) > unsigned long idle_calls; > bool ret; > > + /* > + * The heuristics in this function is for the fair class. For SCX, the > + * performance target comes directly from the BPF scheduler. Let's just > + * follow it. > + */ > + if (scx_switched_all()) > + return false; > + > /* if capped by uclamp_max, always update to be in compliance */ > if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu))) > return false; > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index f814e84ceeb3..04fb0eeee5ec 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -16,6 +16,8 @@ enum scx_consts { > SCX_EXIT_BT_LEN = 64, > SCX_EXIT_MSG_LEN = 1024, > SCX_EXIT_DUMP_DFL_LEN = 32768, > + > + SCX_CPUPERF_ONE = SCHED_CAPACITY_SCALE, > }; > > enum scx_exit_kind { > @@ -3520,7 +3522,7 @@ DEFINE_SCHED_CLASS(ext) = { > .update_curr = update_curr_scx, > > #ifdef CONFIG_UCLAMP_TASK > - .uclamp_enabled = 0, > + .uclamp_enabled = 1, > #endif > }; > Hi. I know this is a bit late, but the implication of this change here can be quite interesting. With this patch but without switching this knob from 0 to 1, this series gives me the perfect opportunity to implement a custom uclamp within sched_ext on top of the cpufreq support added. I think this would be what some vendors looking at sched_ext would also want. But, if .uclamp_enabled == 1, then the mainline uclamp implementation is in effect regardless of what ext scheduler is loaded. In fact, uclamp_{inc,dec}() are before calling the {enqueue,dequeue}_task() so now there's no easy way to circumvent it. What would be really nice is to have cpufreq support in sched_ext but not force uclamp_enabled. But, I also think there will be people who are happy with the current uclamp implementation and want to just reuse it. The best thing is to let the loaded scheduler decide, somehow, which I don't know if there's an easy way to do this yet. > [...]
Hello, Hongyan. On Tue, Jul 02, 2024 at 11:23:58AM +0100, Hongyan Xia wrote: > What would be really nice is to have cpufreq support in sched_ext but not > force uclamp_enabled. But, I also think there will be people who are happy > with the current uclamp implementation and want to just reuse it. The best > thing is to let the loaded scheduler decide, somehow, which I don't know if > there's an easy way to do this yet. I don't know much about uclamp but at least from sched_ext side, it's trivial add an ops flag for it and because we know that no tasks are on the ext class before BPF scheduler is loaded, as long as we switch the uclamp_enabled value while the BPF scheduler is not loaded, the uclamp buckets should stay balanced. AFAICS, the only core change we need to make is mooving the uclamp_enabled bool outside sched_class so that it can be changed runtime. Is that the case or am I missing something? Thanks.
On 02/07/2024 17:37, Tejun Heo wrote: > Hello, Hongyan. > > On Tue, Jul 02, 2024 at 11:23:58AM +0100, Hongyan Xia wrote: >> What would be really nice is to have cpufreq support in sched_ext but not >> force uclamp_enabled. But, I also think there will be people who are happy >> with the current uclamp implementation and want to just reuse it. The best >> thing is to let the loaded scheduler decide, somehow, which I don't know if >> there's an easy way to do this yet. > > I don't know much about uclamp but at least from sched_ext side, it's > trivial add an ops flag for it and because we know that no tasks are on the > ext class before BPF scheduler is loaded, as long as we switch the > uclamp_enabled value while the BPF scheduler is not loaded, the uclamp > buckets should stay balanced. AFAICS, the only core change we need to make > is mooving the uclamp_enabled bool outside sched_class so that it can be > changed runtime. Is that the case or am I missing something? > Pretty much. Just to clarify what I meant, it would be fantastic if for ext, sched_class->uclamp_enabled is decided the moment we load the custom scheduler, not globally enabled all the time for all ext schedulers, in case the custom scheduler wants to ignore uclamp or has its own uclamp implementation. During ext_ops->init(), it would be great if the loaded scheduler could decide whether its sched_class->uclamp_enabled should be enabled. However, sched_class->uclamp_enabled is just a normal struct variable, so I cannot immediately see a clean way to let the loaded scheduler program this field. We might be able to expose a function from the kernel side to write sched_class->uclamp_enabled during ext_ops->init(), although that looks a bit messy.
Hello, So, maybe something like this. It's not the prettiest but avoids adding indirect calls to fair and rt while allowing sched_ext to report what the BPF scheduler wants. Only compile tested. Would something like this work for the use cases you have on mind? Thanks. Index: work/kernel/sched/core.c =================================================================== --- work.orig/kernel/sched/core.c +++ work/kernel/sched/core.c @@ -1671,6 +1671,20 @@ static inline void uclamp_rq_dec_id(stru } } +bool sched_uclamp_enabled(void) +{ + return true; +} + +static bool class_supports_uclamp(const struct sched_class *class) +{ + if (likely(class->uclamp_enabled == sched_uclamp_enabled)) + return true; + if (!class->uclamp_enabled) + return false; + return class->uclamp_enabled(); +} + static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { enum uclamp_id clamp_id; @@ -1684,7 +1698,7 @@ static inline void uclamp_rq_inc(struct if (!static_branch_unlikely(&sched_uclamp_used)) return; - if (unlikely(!p->sched_class->uclamp_enabled)) + if (class_supports_uclamp(p->sched_class)) return; for_each_clamp_id(clamp_id) @@ -1708,7 +1722,7 @@ static inline void uclamp_rq_dec(struct if (!static_branch_unlikely(&sched_uclamp_used)) return; - if (unlikely(!p->sched_class->uclamp_enabled)) + if (class_supports_uclamp(p->sched_class)) return; for_each_clamp_id(clamp_id) Index: work/kernel/sched/ext.c =================================================================== --- work.orig/kernel/sched/ext.c +++ work/kernel/sched/ext.c @@ -116,10 +116,17 @@ enum scx_ops_flags { */ SCX_OPS_SWITCH_PARTIAL = 1LLU << 3, + /* + * Disable built-in uclamp support. Can be useful when the BPF scheduler + * wants to implement custom uclamp support. + */ + SCX_OPS_DISABLE_UCLAMP = 1LLU << 4, + SCX_OPS_ALL_FLAGS = SCX_OPS_KEEP_BUILTIN_IDLE | SCX_OPS_ENQ_LAST | SCX_OPS_ENQ_EXITING | - SCX_OPS_SWITCH_PARTIAL, + SCX_OPS_SWITCH_PARTIAL | + SCX_OPS_DISABLE_UCLAMP, }; /* argument container for ops.init_task() */ @@ -3437,6 +3444,13 @@ static void switched_from_scx(struct rq static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {} static void switched_to_scx(struct rq *rq, struct task_struct *p) {} +#ifdef CONFIG_UCLAMP_TASK +static bool uclamp_enabled_scx(void) +{ + return !(scx_ops.flags & SCX_OPS_DISABLE_UCLAMP); +} +#endif + int scx_check_setscheduler(struct task_struct *p, int policy) { lockdep_assert_rq_held(task_rq(p)); @@ -3522,7 +3536,7 @@ DEFINE_SCHED_CLASS(ext) = { .update_curr = update_curr_scx, #ifdef CONFIG_UCLAMP_TASK - .uclamp_enabled = 1, + .uclamp_enabled = uclamp_enabled_scx, #endif }; Index: work/kernel/sched/fair.c =================================================================== --- work.orig/kernel/sched/fair.c +++ work/kernel/sched/fair.c @@ -13228,9 +13228,7 @@ DEFINE_SCHED_CLASS(fair) = { .task_is_throttled = task_is_throttled_fair, #endif -#ifdef CONFIG_UCLAMP_TASK - .uclamp_enabled = 1, -#endif + SCHED_CLASS_UCLAMP_ENABLED }; #ifdef CONFIG_SCHED_DEBUG Index: work/kernel/sched/rt.c =================================================================== --- work.orig/kernel/sched/rt.c +++ work/kernel/sched/rt.c @@ -2681,9 +2681,7 @@ DEFINE_SCHED_CLASS(rt) = { .task_is_throttled = task_is_throttled_rt, #endif -#ifdef CONFIG_UCLAMP_TASK - .uclamp_enabled = 1, -#endif + SCHED_CLASS_UCLAMP_ENABLED }; #ifdef CONFIG_RT_GROUP_SCHED Index: work/kernel/sched/sched.h =================================================================== --- work.orig/kernel/sched/sched.h +++ work/kernel/sched/sched.h @@ -2339,11 +2339,6 @@ struct affinity_context { extern s64 update_curr_common(struct rq *rq); struct sched_class { - -#ifdef CONFIG_UCLAMP_TASK - int uclamp_enabled; -#endif - void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); void (*yield_task) (struct rq *rq); @@ -2405,8 +2400,21 @@ struct sched_class { #ifdef CONFIG_SCHED_CORE int (*task_is_throttled)(struct task_struct *p, int cpu); #endif + +#ifdef CONFIG_UCLAMP_TASK + bool (*uclamp_enabled)(void); +#endif }; +#ifdef CONFIG_UCLAMP_TASK +bool sched_uclamp_enabled(void); + +#define SCHED_CLASS_UCLAMP_ENABLED \ + .uclamp_enabled = sched_uclamp_enabled, +#else +#define SCHED_CLASS_UCLAMP_ENABLED +#endif + static inline void put_prev_task(struct rq *rq, struct task_struct *prev) { WARN_ON_ONCE(rq->curr != prev);
On 02/07/2024 18:56, Tejun Heo wrote: > Hello, > > So, maybe something like this. It's not the prettiest but avoids adding > indirect calls to fair and rt while allowing sched_ext to report what the > BPF scheduler wants. Only compile tested. Would something like this work for > the use cases you have on mind? > > Thanks. > > Index: work/kernel/sched/core.c > =================================================================== > --- work.orig/kernel/sched/core.c > +++ work/kernel/sched/core.c > @@ -1671,6 +1671,20 @@ static inline void uclamp_rq_dec_id(stru > } > } > > +bool sched_uclamp_enabled(void) > +{ > + return true; > +} > + > +static bool class_supports_uclamp(const struct sched_class *class) > +{ > + if (likely(class->uclamp_enabled == sched_uclamp_enabled)) > + return true; > + if (!class->uclamp_enabled) > + return false; > + return class->uclamp_enabled(); > +} > + > static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) > { > enum uclamp_id clamp_id; > @@ -1684,7 +1698,7 @@ static inline void uclamp_rq_inc(struct > if (!static_branch_unlikely(&sched_uclamp_used)) > return; > > - if (unlikely(!p->sched_class->uclamp_enabled)) > + if (class_supports_uclamp(p->sched_class)) > return; > > for_each_clamp_id(clamp_id) > @@ -1708,7 +1722,7 @@ static inline void uclamp_rq_dec(struct > if (!static_branch_unlikely(&sched_uclamp_used)) > return; > > - if (unlikely(!p->sched_class->uclamp_enabled)) > + if (class_supports_uclamp(p->sched_class)) > return; > > for_each_clamp_id(clamp_id) > Index: work/kernel/sched/ext.c > =================================================================== > --- work.orig/kernel/sched/ext.c > +++ work/kernel/sched/ext.c > @@ -116,10 +116,17 @@ enum scx_ops_flags { > */ > SCX_OPS_SWITCH_PARTIAL = 1LLU << 3, > > + /* > + * Disable built-in uclamp support. Can be useful when the BPF scheduler > + * wants to implement custom uclamp support. > + */ > + SCX_OPS_DISABLE_UCLAMP = 1LLU << 4, > + > SCX_OPS_ALL_FLAGS = SCX_OPS_KEEP_BUILTIN_IDLE | > SCX_OPS_ENQ_LAST | > SCX_OPS_ENQ_EXITING | > - SCX_OPS_SWITCH_PARTIAL, > + SCX_OPS_SWITCH_PARTIAL | > + SCX_OPS_DISABLE_UCLAMP, > }; > > /* argument container for ops.init_task() */ > @@ -3437,6 +3444,13 @@ static void switched_from_scx(struct rq > static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {} > static void switched_to_scx(struct rq *rq, struct task_struct *p) {} > > +#ifdef CONFIG_UCLAMP_TASK > +static bool uclamp_enabled_scx(void) > +{ > + return !(scx_ops.flags & SCX_OPS_DISABLE_UCLAMP); > +} > +#endif > + > int scx_check_setscheduler(struct task_struct *p, int policy) > { > lockdep_assert_rq_held(task_rq(p)); > @@ -3522,7 +3536,7 @@ DEFINE_SCHED_CLASS(ext) = { > .update_curr = update_curr_scx, > > #ifdef CONFIG_UCLAMP_TASK > - .uclamp_enabled = 1, > + .uclamp_enabled = uclamp_enabled_scx, > #endif > }; > > Index: work/kernel/sched/fair.c > =================================================================== > --- work.orig/kernel/sched/fair.c > +++ work/kernel/sched/fair.c > @@ -13228,9 +13228,7 @@ DEFINE_SCHED_CLASS(fair) = { > .task_is_throttled = task_is_throttled_fair, > #endif > > -#ifdef CONFIG_UCLAMP_TASK > - .uclamp_enabled = 1, > -#endif > + SCHED_CLASS_UCLAMP_ENABLED > }; > > #ifdef CONFIG_SCHED_DEBUG > Index: work/kernel/sched/rt.c > =================================================================== > --- work.orig/kernel/sched/rt.c > +++ work/kernel/sched/rt.c > @@ -2681,9 +2681,7 @@ DEFINE_SCHED_CLASS(rt) = { > .task_is_throttled = task_is_throttled_rt, > #endif > > -#ifdef CONFIG_UCLAMP_TASK > - .uclamp_enabled = 1, > -#endif > + SCHED_CLASS_UCLAMP_ENABLED > }; > > #ifdef CONFIG_RT_GROUP_SCHED > Index: work/kernel/sched/sched.h > =================================================================== > --- work.orig/kernel/sched/sched.h > +++ work/kernel/sched/sched.h > @@ -2339,11 +2339,6 @@ struct affinity_context { > extern s64 update_curr_common(struct rq *rq); > > struct sched_class { > - > -#ifdef CONFIG_UCLAMP_TASK > - int uclamp_enabled; > -#endif > - > void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags); > void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags); > void (*yield_task) (struct rq *rq); > @@ -2405,8 +2400,21 @@ struct sched_class { > #ifdef CONFIG_SCHED_CORE > int (*task_is_throttled)(struct task_struct *p, int cpu); > #endif > + > +#ifdef CONFIG_UCLAMP_TASK > + bool (*uclamp_enabled)(void); > +#endif > }; > > +#ifdef CONFIG_UCLAMP_TASK > +bool sched_uclamp_enabled(void); > + > +#define SCHED_CLASS_UCLAMP_ENABLED \ > + .uclamp_enabled = sched_uclamp_enabled, > +#else > +#define SCHED_CLASS_UCLAMP_ENABLED > +#endif > + > static inline void put_prev_task(struct rq *rq, struct task_struct *prev) > { > WARN_ON_ONCE(rq->curr != prev); Looks good to me! Actually, if we are okay with changing the sched_class struct and touching the code of other classes, I wonder if a cleaner solution is just to completely remove sched_class->uclamp_enabled and let each class decide what to do in enqueue and dequeue, so instead of uclamp_inc/dec(); p->sched_class->enqueue/dequeue_task(); we can just p->sched_class->enqueue/dequeue_task(); do_uclamp_inside_each_class(); and we export uclamp_inc/dec() functions from core.c to RT, fair and ext. For RT and fair, just enqueue/dequeue_task_fair/rt(); uclamp_inc/dec(); For ext, maybe expose bpf_uclamp_inc/dec() to the custom scheduler. If a scheduler wants to reuse the current uclamp implementation, just call these. If not, skip them and implement its own.
Hello, On Tue, Jul 02, 2024 at 09:41:30PM +0100, Hongyan Xia wrote: ... > Actually, if we are okay with changing the sched_class struct and touching > the code of other classes, I wonder if a cleaner solution is just to > completely remove sched_class->uclamp_enabled and let each class decide what > to do in enqueue and dequeue, so instead of > > uclamp_inc/dec(); > p->sched_class->enqueue/dequeue_task(); > > we can just > > p->sched_class->enqueue/dequeue_task(); > do_uclamp_inside_each_class(); > > and we export uclamp_inc/dec() functions from core.c to RT, fair and ext. > For RT and fair, just > > enqueue/dequeue_task_fair/rt(); > uclamp_inc/dec(); > > For ext, maybe expose bpf_uclamp_inc/dec() to the custom scheduler. If a > scheduler wants to reuse the current uclamp implementation, just call these. > If not, skip them and implement its own. That does sound a lot better. Mind writing up a patchset? Thanks.
On 06/18/24 17:12, Tejun Heo wrote: > sched_ext currently does not integrate with schedutil. When schedutil is the > governor, frequencies are left unregulated and usually get stuck close to > the highest performance level from running RT tasks. Have you tried to investigate why is that? By default RT run at max frequency. Only way to prevent them from doing that is by using uclamp https://kernel.org/doc/html/latest/scheduler/sched-util-clamp.html#sched-util-clamp-min-rt-default If that's not the cause, then it's likely something else is broken. > > Add CPU performance monitoring and scaling support by integrating into > schedutil. The following kfuncs are added: > > - scx_bpf_cpuperf_cap(): Query the relative performance capacity of > different CPUs in the system. > > - scx_bpf_cpuperf_cur(): Query the current performance level of a CPU > relative to its max performance. > > - scx_bpf_cpuperf_set(): Set the current target performance level of a CPU. What is exactly the problem you're seeing? You shouldn't need to set performance directly. Are you trying to fix a problem, or add a new feature? > > This gives direct control over CPU performance setting to the BPF scheduler. Why would we need to do that? schedutil is supposed to operate in utilization signal. Overriding it with custom unknown changes makes it all random governor based on what's current bpf sched_ext is loaded? This make bug reports and debugging problems a lot harder. I do hope by the way that loading external scheduler does cause the kernel to be tainted. With these random changes, it's hard to know if it is a problem in the kernel or with external out of tree entity. Out of tree modules taint the kernel, so should loading sched_ext. It should not cause spurious reports, nor prevent us from changing the code without worrying about breaking out of tree code. > The only changes on the schedutil side are accounting for the utilization > factor from sched_ext and disabling frequency holding heuristics as it may > not apply well to sched_ext schedulers which may have a lot weaker > connection between tasks and their current / last CPU. > > With cpuperf support added, there is no reason to block uclamp. Enable while > at it. > > A toy implementation of cpuperf is added to scx_qmap as a demonstration of > the feature. > > Signed-off-by: Tejun Heo <tj@kernel.org> > Reviewed-by: David Vernet <dvernet@meta.com> > Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> > Cc: Viresh Kumar <viresh.kumar@linaro.org> > --- > kernel/sched/cpufreq_schedutil.c | 12 +- > kernel/sched/ext.c | 83 ++++++++++++- > kernel/sched/ext.h | 9 ++ > kernel/sched/sched.h | 1 + > tools/sched_ext/include/scx/common.bpf.h | 3 + > tools/sched_ext/scx_qmap.bpf.c | 142 ++++++++++++++++++++++- > tools/sched_ext/scx_qmap.c | 8 ++ > 7 files changed, 252 insertions(+), 6 deletions(-) > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c > index 972b7dd65af2..12174c0137a5 100644 > --- a/kernel/sched/cpufreq_schedutil.c > +++ b/kernel/sched/cpufreq_schedutil.c > @@ -197,7 +197,9 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual, > > static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost) > { > - unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu); > + unsigned long min, max; > + unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu) + > + scx_cpuperf_target(sg_cpu->cpu); > > util = effective_cpu_util(sg_cpu->cpu, util, &min, &max); > util = max(util, boost); > @@ -330,6 +332,14 @@ static bool sugov_hold_freq(struct sugov_cpu *sg_cpu) > unsigned long idle_calls; > bool ret; > > + /* > + * The heuristics in this function is for the fair class. For SCX, the > + * performance target comes directly from the BPF scheduler. Let's just > + * follow it. > + */ > + if (scx_switched_all()) > + return false; Why do you need to totally override? What problems did you find in current util value and what have you done to try to fix it first rather than override it completely? > + > /* if capped by uclamp_max, always update to be in compliance */ > if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu))) > return false; > diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c > index f814e84ceeb3..04fb0eeee5ec 100644 > --- a/kernel/sched/ext.c > +++ b/kernel/sched/ext.c > @@ -16,6 +16,8 @@ enum scx_consts { > SCX_EXIT_BT_LEN = 64, > SCX_EXIT_MSG_LEN = 1024, > SCX_EXIT_DUMP_DFL_LEN = 32768, > + > + SCX_CPUPERF_ONE = SCHED_CAPACITY_SCALE, > }; > > enum scx_exit_kind { > @@ -3520,7 +3522,7 @@ DEFINE_SCHED_CLASS(ext) = { > .update_curr = update_curr_scx, > > #ifdef CONFIG_UCLAMP_TASK > - .uclamp_enabled = 0, > + .uclamp_enabled = 1, > #endif > }; > > @@ -4393,7 +4395,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link) > struct scx_task_iter sti; > struct task_struct *p; > unsigned long timeout; > - int i, ret; > + int i, cpu, ret; > > mutex_lock(&scx_ops_enable_mutex); > > @@ -4442,6 +4444,9 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link) > > atomic_long_set(&scx_nr_rejected, 0); > > + for_each_possible_cpu(cpu) > + cpu_rq(cpu)->scx.cpuperf_target = SCX_CPUPERF_ONE; > + > /* > * Keep CPUs stable during enable so that the BPF scheduler can track > * online CPUs by watching ->on/offline_cpu() after ->init(). > @@ -5835,6 +5840,77 @@ __bpf_kfunc void scx_bpf_dump_bstr(char *fmt, unsigned long long *data, > ops_dump_flush(); > } > > +/** > + * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU > + * @cpu: CPU of interest > + * > + * Return the maximum relative capacity of @cpu in relation to the most > + * performant CPU in the system. The return value is in the range [1, > + * %SCX_CPUPERF_ONE]. See scx_bpf_cpuperf_cur(). > + */ > +__bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu) > +{ > + if (ops_cpu_valid(cpu, NULL)) > + return arch_scale_cpu_capacity(cpu); > + else > + return SCX_CPUPERF_ONE; > +} Hmm. This is tricky. It looks fine, but I worry about changing how we want to handle capacities in the future and then being tied down forever with out of tree sched_ext not being able to load. How are we going to protect against such potential changes? Just make it a NOP? A bit hypothetical but so far these are considered internal scheduler details that could change anytime with no consequence. With this attaching to this info changing them will become a lot harder as there's external dependencies that will fail to load or work properly. And what is the regression rule in this case? You should make all functions return an error to future proof them against suddenly disappearing. > + > +/** > + * scx_bpf_cpuperf_cur - Query the current relative performance of a CPU > + * @cpu: CPU of interest > + * > + * Return the current relative performance of @cpu in relation to its maximum. > + * The return value is in the range [1, %SCX_CPUPERF_ONE]. > + * > + * The current performance level of a CPU in relation to the maximum performance > + * available in the system can be calculated as follows: > + * > + * scx_bpf_cpuperf_cap() * scx_bpf_cpuperf_cur() / %SCX_CPUPERF_ONE > + * > + * The result is in the range [1, %SCX_CPUPERF_ONE]. > + */ > +__bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu) > +{ > + if (ops_cpu_valid(cpu, NULL)) > + return arch_scale_freq_capacity(cpu); > + else > + return SCX_CPUPERF_ONE; > +} > + > +/** > + * scx_bpf_cpuperf_set - Set the relative performance target of a CPU > + * @cpu: CPU of interest > + * @perf: target performance level [0, %SCX_CPUPERF_ONE] > + * @flags: %SCX_CPUPERF_* flags > + * > + * Set the target performance level of @cpu to @perf. @perf is in linear > + * relative scale between 0 and %SCX_CPUPERF_ONE. This determines how the > + * schedutil cpufreq governor chooses the target frequency. > + * > + * The actual performance level chosen, CPU grouping, and the overhead and > + * latency of the operations are dependent on the hardware and cpufreq driver in > + * use. Consult hardware and cpufreq documentation for more information. The > + * current performance level can be monitored using scx_bpf_cpuperf_cur(). > + */ > +__bpf_kfunc void scx_bpf_cpuperf_set(u32 cpu, u32 perf) > +{ > + if (unlikely(perf > SCX_CPUPERF_ONE)) { > + scx_ops_error("Invalid cpuperf target %u for CPU %d", perf, cpu); > + return; > + } > + > + if (ops_cpu_valid(cpu, NULL)) { > + struct rq *rq = cpu_rq(cpu); > + > + rq->scx.cpuperf_target = perf; > + > + rcu_read_lock_sched_notrace(); > + cpufreq_update_util(cpu_rq(cpu), 0); > + rcu_read_unlock_sched_notrace(); > + } > +} Is the problem that you break how util signal works in sched_ext? Or you want the fine control? We expect user application to use uclamp to set their perf requirement. And sched_ext should not break util signal, no? If it does and there's a good reason for it, then it is not compatible with schedutil, as the name indicates it operates on util signal as defined in PELT. You can always use min_freq/max_freq in sysfs to force min and max frequencies without hacking the governor. I don't advise it though and I'd recommend trying to be compatible with schedutil as-is rather than modify it. Consistency is a key. Thanks -- Qais Yousef
Hello, Qais. On Thu, Jul 25, 2024 at 12:45:27AM +0100, Qais Yousef wrote: > On 06/18/24 17:12, Tejun Heo wrote: > > sched_ext currently does not integrate with schedutil. When schedutil is the > > governor, frequencies are left unregulated and usually get stuck close to > > the highest performance level from running RT tasks. > > Have you tried to investigate why is that? By default RT run at max frequency. > Only way to prevent them from doing that is by using uclamp > > https://kernel.org/doc/html/latest/scheduler/sched-util-clamp.html#sched-util-clamp-min-rt-default > > If that's not the cause, then it's likely something else is broken. Nothing is necessarily broken. SCX just wasn't providing any signal to schedutil before this patch, so schedutil ends up just going by with the occasional signals from RT. ... > What is exactly the problem you're seeing? You shouldn't need to set > performance directly. Are you trying to fix a problem, or add a new feature? I'm having a hard time following the question. When active, the BPF scheduler is the only one who can tell how much each CPU is being used, and the straightforward to integrate with schedutil is by indicating what the current CPU utilization level is which is the signal that schedutil takes from each scheduler class. > > This gives direct control over CPU performance setting to the BPF scheduler. > > Why would we need to do that? schedutil is supposed to operate in utilization Because nobody else knows? No one *can* know. Schedutil is driven by the utilization signal from the scheduler. Here, the scheduler is implemented in BPF. Nothing else knows how much workload each CPU is going to get. Note that the kernel side does not have any meaningful information re. task <-> CPU relationship. That can change on every dispatch. Only the BPF scheduler itself knows how the CPUs are going to be used. > signal. Overriding it with custom unknown changes makes it all random governor > based on what's current bpf sched_ext is loaded? This make bug reports and > debugging problems a lot harder. If schedutil is misbehaving while an SCX scheduler is loaded, it's the BPF scheduler's fault. There isn't much else to it. > I do hope by the way that loading external scheduler does cause the kernel to > be tainted. With these random changes, it's hard to know if it is a problem in > the kernel or with external out of tree entity. Out of tree modules taint the > kernel, so should loading sched_ext. Out of tree modules can cause corruptions which can have lasting impacts on the system even after the module is unloaded. That's why we keep the persistent taint flags - to remember that the current misbehavior has a reasonable chance of being impacted by what happened to the system beforehand. Kernel bugs aside, SCX schedulers shouldn't leave persistent impacts on the system. In those cases, we usually mark the dumps which SCX already does. > It should not cause spurious reports, nor prevent us from changing the code > without worrying about breaking out of tree code. Oh yeah, we can and will make compatibility breaking changes. Not willy-nilly, hopefully. ... > > + /* > > + * The heuristics in this function is for the fair class. For SCX, the > > + * performance target comes directly from the BPF scheduler. Let's just > > + * follow it. > > + */ > > + if (scx_switched_all()) > > + return false; > > Why do you need to totally override? What problems did you find in current util > value and what have you done to try to fix it first rather than override it > completely? Because it's way cleaner this way. Otherwise, we need to keep calling into update_blocked_averages() so that the fair class's util metrics can decay (it often doesn't decay completely due to math inaccuracies but Vincent was looking at it). When switched_all(), the fair class is not used at all and keeping calling it to decay the utility metrics to zero seems kinda silly. ... > > +/** > > + * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU > > + * @cpu: CPU of interest > > + * > > + * Return the maximum relative capacity of @cpu in relation to the most > > + * performant CPU in the system. The return value is in the range [1, > > + * %SCX_CPUPERF_ONE]. See scx_bpf_cpuperf_cur(). > > + */ > > +__bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu) > > +{ > > + if (ops_cpu_valid(cpu, NULL)) > > + return arch_scale_cpu_capacity(cpu); > > + else > > + return SCX_CPUPERF_ONE; > > +} > > Hmm. This is tricky. It looks fine, but I worry about changing how we want to > handle capacities in the future and then being tied down forever with out of > tree sched_ext not being able to load. > > How are we going to protect against such potential changes? Just make it a NOP? So, if it's a necessary change, we just break the API and update the out-of-tree schedulers accordingly. With BPF's CO-RE and supporting features, we can handle quite a bit of backward compatibility - ie. it's cumbersome but usually possible to provide BPF-side helpers that allow updated BPF schedulers to be loaded in both old and new kernels, so it usually isn't *that* painful. > A bit hypothetical but so far these are considered internal scheduler details > that could change anytime with no consequence. With this attaching to this info > changing them will become a lot harder as there's external dependencies that > will fail to load or work properly. And what is the regression rule in this > case? This is not different from any other BPF hooks and has been discussed many times. As I wrote before, we don't want to change for no reason but we definitely can change things if necessary. > You should make all functions return an error to future proof them against > suddenly disappearing. Really, no need to be that draconian. ... > > +__bpf_kfunc void scx_bpf_cpuperf_set(u32 cpu, u32 perf) > > +{ > > + if (unlikely(perf > SCX_CPUPERF_ONE)) { > > + scx_ops_error("Invalid cpuperf target %u for CPU %d", perf, cpu); > > + return; > > + } > > + > > + if (ops_cpu_valid(cpu, NULL)) { > > + struct rq *rq = cpu_rq(cpu); > > + > > + rq->scx.cpuperf_target = perf; > > + > > + rcu_read_lock_sched_notrace(); > > + cpufreq_update_util(cpu_rq(cpu), 0); > > + rcu_read_unlock_sched_notrace(); > > + } > > +} > > Is the problem that you break how util signal works in sched_ext? Or you want > the fine control? We expect user application to use uclamp to set their perf > requirement. And sched_ext should not break util signal, no? If it does and > there's a good reason for it, then it is not compatible with schedutil, as the > name indicates it operates on util signal as defined in PELT. > > You can always use min_freq/max_freq in sysfs to force min and max frequencies > without hacking the governor. I don't advise it though and I'd recommend trying > to be compatible with schedutil as-is rather than modify it. Consistency is > a key. It's not hacking the governor and uclamp should work the same way on top. The BPF scheduler is the only thing that can tell how and how much a given CPU is going to be used. There is no way for the kernel to project per-CPU utilization without asking the BPF scheduler. Thanks.
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 972b7dd65af2..12174c0137a5 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -197,7 +197,9 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual, static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost) { - unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu); + unsigned long min, max; + unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu) + + scx_cpuperf_target(sg_cpu->cpu); util = effective_cpu_util(sg_cpu->cpu, util, &min, &max); util = max(util, boost); @@ -330,6 +332,14 @@ static bool sugov_hold_freq(struct sugov_cpu *sg_cpu) unsigned long idle_calls; bool ret; + /* + * The heuristics in this function is for the fair class. For SCX, the + * performance target comes directly from the BPF scheduler. Let's just + * follow it. + */ + if (scx_switched_all()) + return false; + /* if capped by uclamp_max, always update to be in compliance */ if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu))) return false; diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index f814e84ceeb3..04fb0eeee5ec 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -16,6 +16,8 @@ enum scx_consts { SCX_EXIT_BT_LEN = 64, SCX_EXIT_MSG_LEN = 1024, SCX_EXIT_DUMP_DFL_LEN = 32768, + + SCX_CPUPERF_ONE = SCHED_CAPACITY_SCALE, }; enum scx_exit_kind { @@ -3520,7 +3522,7 @@ DEFINE_SCHED_CLASS(ext) = { .update_curr = update_curr_scx, #ifdef CONFIG_UCLAMP_TASK - .uclamp_enabled = 0, + .uclamp_enabled = 1, #endif }; @@ -4393,7 +4395,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link) struct scx_task_iter sti; struct task_struct *p; unsigned long timeout; - int i, ret; + int i, cpu, ret; mutex_lock(&scx_ops_enable_mutex); @@ -4442,6 +4444,9 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link) atomic_long_set(&scx_nr_rejected, 0); + for_each_possible_cpu(cpu) + cpu_rq(cpu)->scx.cpuperf_target = SCX_CPUPERF_ONE; + /* * Keep CPUs stable during enable so that the BPF scheduler can track * online CPUs by watching ->on/offline_cpu() after ->init(). @@ -5835,6 +5840,77 @@ __bpf_kfunc void scx_bpf_dump_bstr(char *fmt, unsigned long long *data, ops_dump_flush(); } +/** + * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU + * @cpu: CPU of interest + * + * Return the maximum relative capacity of @cpu in relation to the most + * performant CPU in the system. The return value is in the range [1, + * %SCX_CPUPERF_ONE]. See scx_bpf_cpuperf_cur(). + */ +__bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu) +{ + if (ops_cpu_valid(cpu, NULL)) + return arch_scale_cpu_capacity(cpu); + else + return SCX_CPUPERF_ONE; +} + +/** + * scx_bpf_cpuperf_cur - Query the current relative performance of a CPU + * @cpu: CPU of interest + * + * Return the current relative performance of @cpu in relation to its maximum. + * The return value is in the range [1, %SCX_CPUPERF_ONE]. + * + * The current performance level of a CPU in relation to the maximum performance + * available in the system can be calculated as follows: + * + * scx_bpf_cpuperf_cap() * scx_bpf_cpuperf_cur() / %SCX_CPUPERF_ONE + * + * The result is in the range [1, %SCX_CPUPERF_ONE]. + */ +__bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu) +{ + if (ops_cpu_valid(cpu, NULL)) + return arch_scale_freq_capacity(cpu); + else + return SCX_CPUPERF_ONE; +} + +/** + * scx_bpf_cpuperf_set - Set the relative performance target of a CPU + * @cpu: CPU of interest + * @perf: target performance level [0, %SCX_CPUPERF_ONE] + * @flags: %SCX_CPUPERF_* flags + * + * Set the target performance level of @cpu to @perf. @perf is in linear + * relative scale between 0 and %SCX_CPUPERF_ONE. This determines how the + * schedutil cpufreq governor chooses the target frequency. + * + * The actual performance level chosen, CPU grouping, and the overhead and + * latency of the operations are dependent on the hardware and cpufreq driver in + * use. Consult hardware and cpufreq documentation for more information. The + * current performance level can be monitored using scx_bpf_cpuperf_cur(). + */ +__bpf_kfunc void scx_bpf_cpuperf_set(u32 cpu, u32 perf) +{ + if (unlikely(perf > SCX_CPUPERF_ONE)) { + scx_ops_error("Invalid cpuperf target %u for CPU %d", perf, cpu); + return; + } + + if (ops_cpu_valid(cpu, NULL)) { + struct rq *rq = cpu_rq(cpu); + + rq->scx.cpuperf_target = perf; + + rcu_read_lock_sched_notrace(); + cpufreq_update_util(cpu_rq(cpu), 0); + rcu_read_unlock_sched_notrace(); + } +} + /** * scx_bpf_nr_cpu_ids - Return the number of possible CPU IDs * @@ -6045,6 +6121,9 @@ BTF_ID_FLAGS(func, scx_bpf_destroy_dsq) BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_TRUSTED_ARGS) BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_TRUSTED_ARGS) BTF_ID_FLAGS(func, scx_bpf_dump_bstr, KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap) +BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur) +BTF_ID_FLAGS(func, scx_bpf_cpuperf_set) BTF_ID_FLAGS(func, scx_bpf_nr_cpu_ids) BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE) BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE) diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h index c41d742b5d62..c7ac33c47b68 100644 --- a/kernel/sched/ext.h +++ b/kernel/sched/ext.h @@ -48,6 +48,14 @@ int scx_check_setscheduler(struct task_struct *p, int policy); bool task_should_scx(struct task_struct *p); void init_sched_ext_class(void); +static inline u32 scx_cpuperf_target(s32 cpu) +{ + if (scx_enabled()) + return cpu_rq(cpu)->scx.cpuperf_target; + else + return 0; +} + static inline const struct sched_class *next_active_class(const struct sched_class *class) { class++; @@ -89,6 +97,7 @@ static inline void scx_pre_fork(struct task_struct *p) {} static inline int scx_fork(struct task_struct *p) { return 0; } static inline void scx_post_fork(struct task_struct *p) {} static inline void scx_cancel_fork(struct task_struct *p) {} +static inline u32 scx_cpuperf_target(s32 cpu) { return 0; } static inline bool scx_can_stop_tick(struct rq *rq) { return true; } static inline void scx_rq_activate(struct rq *rq) {} static inline void scx_rq_deactivate(struct rq *rq) {} diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index c0d6e42c99cc..d3912cf3c3b7 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -743,6 +743,7 @@ struct scx_rq { u64 extra_enq_flags; /* see move_task_to_local_dsq() */ u32 nr_running; u32 flags; + u32 cpuperf_target; /* [0, SCHED_CAPACITY_SCALE] */ bool cpu_released; cpumask_var_t cpus_to_kick; cpumask_var_t cpus_to_kick_if_idle; diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h index 3fa87084cf17..dbbda0e35c5d 100644 --- a/tools/sched_ext/include/scx/common.bpf.h +++ b/tools/sched_ext/include/scx/common.bpf.h @@ -42,6 +42,9 @@ void scx_bpf_destroy_dsq(u64 dsq_id) __ksym; void scx_bpf_exit_bstr(s64 exit_code, char *fmt, unsigned long long *data, u32 data__sz) __ksym __weak; void scx_bpf_error_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym; void scx_bpf_dump_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym __weak; +u32 scx_bpf_cpuperf_cap(s32 cpu) __ksym __weak; +u32 scx_bpf_cpuperf_cur(s32 cpu) __ksym __weak; +void scx_bpf_cpuperf_set(s32 cpu, u32 perf) __ksym __weak; u32 scx_bpf_nr_cpu_ids(void) __ksym __weak; const struct cpumask *scx_bpf_get_possible_cpumask(void) __ksym __weak; const struct cpumask *scx_bpf_get_online_cpumask(void) __ksym __weak; diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c index c75c70d6a8eb..b1d0b09c966e 100644 --- a/tools/sched_ext/scx_qmap.bpf.c +++ b/tools/sched_ext/scx_qmap.bpf.c @@ -68,6 +68,18 @@ struct { }, }; +/* + * If enabled, CPU performance target is set according to the queue index + * according to the following table. + */ +static const u32 qidx_to_cpuperf_target[] = { + [0] = SCX_CPUPERF_ONE * 0 / 4, + [1] = SCX_CPUPERF_ONE * 1 / 4, + [2] = SCX_CPUPERF_ONE * 2 / 4, + [3] = SCX_CPUPERF_ONE * 3 / 4, + [4] = SCX_CPUPERF_ONE * 4 / 4, +}; + /* * Per-queue sequence numbers to implement core-sched ordering. * @@ -95,6 +107,8 @@ struct { struct cpu_ctx { u64 dsp_idx; /* dispatch index */ u64 dsp_cnt; /* remaining count */ + u32 avg_weight; + u32 cpuperf_target; }; struct { @@ -107,6 +121,8 @@ struct { /* Statistics */ u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued; u64 nr_core_sched_execed; +u32 cpuperf_min, cpuperf_avg, cpuperf_max; +u32 cpuperf_target_min, cpuperf_target_avg, cpuperf_target_max; s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p, s32 prev_cpu, u64 wake_flags) @@ -313,6 +329,29 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev) } } +void BPF_STRUCT_OPS(qmap_tick, struct task_struct *p) +{ + struct cpu_ctx *cpuc; + u32 zero = 0; + int idx; + + if (!(cpuc = bpf_map_lookup_elem(&cpu_ctx_stor, &zero))) { + scx_bpf_error("failed to look up cpu_ctx"); + return; + } + + /* + * Use the running avg of weights to select the target cpuperf level. + * This is a demonstration of the cpuperf feature rather than a + * practical strategy to regulate CPU frequency. + */ + cpuc->avg_weight = cpuc->avg_weight * 3 / 4 + p->scx.weight / 4; + idx = weight_to_idx(cpuc->avg_weight); + cpuc->cpuperf_target = qidx_to_cpuperf_target[idx]; + + scx_bpf_cpuperf_set(scx_bpf_task_cpu(p), cpuc->cpuperf_target); +} + /* * The distance from the head of the queue scaled by the weight of the queue. * The lower the number, the older the task and the higher the priority. @@ -422,8 +461,9 @@ void BPF_STRUCT_OPS(qmap_dump_cpu, struct scx_dump_ctx *dctx, s32 cpu, bool idle if (!(cpuc = bpf_map_lookup_percpu_elem(&cpu_ctx_stor, &zero, cpu))) return; - scx_bpf_dump("QMAP: dsp_idx=%llu dsp_cnt=%llu", - cpuc->dsp_idx, cpuc->dsp_cnt); + scx_bpf_dump("QMAP: dsp_idx=%llu dsp_cnt=%llu avg_weight=%u cpuperf_target=%u", + cpuc->dsp_idx, cpuc->dsp_cnt, cpuc->avg_weight, + cpuc->cpuperf_target); } void BPF_STRUCT_OPS(qmap_dump_task, struct scx_dump_ctx *dctx, struct task_struct *p) @@ -492,11 +532,106 @@ void BPF_STRUCT_OPS(qmap_cpu_offline, s32 cpu) print_cpus(); } +struct monitor_timer { + struct bpf_timer timer; +}; + +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, 1); + __type(key, u32); + __type(value, struct monitor_timer); +} monitor_timer SEC(".maps"); + +/* + * Print out the min, avg and max performance levels of CPUs every second to + * demonstrate the cpuperf interface. + */ +static void monitor_cpuperf(void) +{ + u32 zero = 0, nr_cpu_ids; + u64 cap_sum = 0, cur_sum = 0, cur_min = SCX_CPUPERF_ONE, cur_max = 0; + u64 target_sum = 0, target_min = SCX_CPUPERF_ONE, target_max = 0; + const struct cpumask *online; + int i, nr_online_cpus = 0; + + nr_cpu_ids = scx_bpf_nr_cpu_ids(); + online = scx_bpf_get_online_cpumask(); + + bpf_for(i, 0, nr_cpu_ids) { + struct cpu_ctx *cpuc; + u32 cap, cur; + + if (!bpf_cpumask_test_cpu(i, online)) + continue; + nr_online_cpus++; + + /* collect the capacity and current cpuperf */ + cap = scx_bpf_cpuperf_cap(i); + cur = scx_bpf_cpuperf_cur(i); + + cur_min = cur < cur_min ? cur : cur_min; + cur_max = cur > cur_max ? cur : cur_max; + + /* + * $cur is relative to $cap. Scale it down accordingly so that + * it's in the same scale as other CPUs and $cur_sum/$cap_sum + * makes sense. + */ + cur_sum += cur * cap / SCX_CPUPERF_ONE; + cap_sum += cap; + + if (!(cpuc = bpf_map_lookup_percpu_elem(&cpu_ctx_stor, &zero, i))) { + scx_bpf_error("failed to look up cpu_ctx"); + goto out; + } + + /* collect target */ + cur = cpuc->cpuperf_target; + target_sum += cur; + target_min = cur < target_min ? cur : target_min; + target_max = cur > target_max ? cur : target_max; + } + + cpuperf_min = cur_min; + cpuperf_avg = cur_sum * SCX_CPUPERF_ONE / cap_sum; + cpuperf_max = cur_max; + + cpuperf_target_min = target_min; + cpuperf_target_avg = target_sum / nr_online_cpus; + cpuperf_target_max = target_max; +out: + scx_bpf_put_cpumask(online); +} + +static int monitor_timerfn(void *map, int *key, struct bpf_timer *timer) +{ + monitor_cpuperf(); + + bpf_timer_start(timer, ONE_SEC_IN_NS, 0); + return 0; +} + s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init) { + u32 key = 0; + struct bpf_timer *timer; + s32 ret; + print_cpus(); - return scx_bpf_create_dsq(SHARED_DSQ, -1); + ret = scx_bpf_create_dsq(SHARED_DSQ, -1); + if (ret) + return ret; + + timer = bpf_map_lookup_elem(&monitor_timer, &key); + if (!timer) + return -ESRCH; + + bpf_timer_init(timer, &monitor_timer, CLOCK_MONOTONIC); + bpf_timer_set_callback(timer, monitor_timerfn); + + return bpf_timer_start(timer, ONE_SEC_IN_NS, 0); } void BPF_STRUCT_OPS(qmap_exit, struct scx_exit_info *ei) @@ -509,6 +644,7 @@ SCX_OPS_DEFINE(qmap_ops, .enqueue = (void *)qmap_enqueue, .dequeue = (void *)qmap_dequeue, .dispatch = (void *)qmap_dispatch, + .tick = (void *)qmap_tick, .core_sched_before = (void *)qmap_core_sched_before, .cpu_release = (void *)qmap_cpu_release, .init_task = (void *)qmap_init_task, diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c index bc36ec4f88a7..4d41c0cb1dab 100644 --- a/tools/sched_ext/scx_qmap.c +++ b/tools/sched_ext/scx_qmap.c @@ -116,6 +116,14 @@ int main(int argc, char **argv) nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched, skel->bss->nr_reenqueued, skel->bss->nr_dequeued, skel->bss->nr_core_sched_execed); + if (__COMPAT_has_ksym("scx_bpf_cpuperf_cur")) + printf("cpuperf: cur min/avg/max=%u/%u/%u target min/avg/max=%u/%u/%u\n", + skel->bss->cpuperf_min, + skel->bss->cpuperf_avg, + skel->bss->cpuperf_max, + skel->bss->cpuperf_target_min, + skel->bss->cpuperf_target_avg, + skel->bss->cpuperf_target_max); fflush(stdout); sleep(1); }