diff mbox series

[2/2] sched_ext: Add cpuperf support

Message ID 20240619031250.2936087-3-tj@kernel.org (mailing list archive)
State Handled Elsewhere, archived
Headers show
Series [1/2] cpufreq_schedutil: Refactor sugov_cpu_is_busy() | expand

Commit Message

Tejun Heo June 19, 2024, 3:12 a.m. UTC
sched_ext currently does not integrate with schedutil. When schedutil is the
governor, frequencies are left unregulated and usually get stuck close to
the highest performance level from running RT tasks.

Add CPU performance monitoring and scaling support by integrating into
schedutil. The following kfuncs are added:

- scx_bpf_cpuperf_cap(): Query the relative performance capacity of
  different CPUs in the system.

- scx_bpf_cpuperf_cur(): Query the current performance level of a CPU
  relative to its max performance.

- scx_bpf_cpuperf_set(): Set the current target performance level of a CPU.

This gives direct control over CPU performance setting to the BPF scheduler.
The only changes on the schedutil side are accounting for the utilization
factor from sched_ext and disabling frequency holding heuristics as it may
not apply well to sched_ext schedulers which may have a lot weaker
connection between tasks and their current / last CPU.

With cpuperf support added, there is no reason to block uclamp. Enable while
at it.

A toy implementation of cpuperf is added to scx_qmap as a demonstration of
the feature.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
---
 kernel/sched/cpufreq_schedutil.c         |  12 +-
 kernel/sched/ext.c                       |  83 ++++++++++++-
 kernel/sched/ext.h                       |   9 ++
 kernel/sched/sched.h                     |   1 +
 tools/sched_ext/include/scx/common.bpf.h |   3 +
 tools/sched_ext/scx_qmap.bpf.c           | 142 ++++++++++++++++++++++-
 tools/sched_ext/scx_qmap.c               |   8 ++
 7 files changed, 252 insertions(+), 6 deletions(-)

Comments

Christian Loehle June 19, 2024, 2:07 p.m. UTC | #1
On 6/19/24 04:12, Tejun Heo wrote:
> sched_ext currently does not integrate with schedutil. When schedutil is the
> governor, frequencies are left unregulated and usually get stuck close to
> the highest performance level from running RT tasks.
> 
> Add CPU performance monitoring and scaling support by integrating into
> schedutil. The following kfuncs are added:
> 
> - scx_bpf_cpuperf_cap(): Query the relative performance capacity of
>   different CPUs in the system.
> 
> - scx_bpf_cpuperf_cur(): Query the current performance level of a CPU
>   relative to its max performance.
> 
> - scx_bpf_cpuperf_set(): Set the current target performance level of a CPU.
> 
> This gives direct control over CPU performance setting to the BPF scheduler.
> The only changes on the schedutil side are accounting for the utilization
> factor from sched_ext and disabling frequency holding heuristics as it may
> not apply well to sched_ext schedulers which may have a lot weaker
> connection between tasks and their current / last CPU.
> 
> With cpuperf support added, there is no reason to block uclamp. Enable while
> at it.
> 
> A toy implementation of cpuperf is added to scx_qmap as a demonstration of
> the feature.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: David Vernet <dvernet@meta.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Viresh Kumar <viresh.kumar@linaro.org>
> ---
>  kernel/sched/cpufreq_schedutil.c         |  12 +-
>  kernel/sched/ext.c                       |  83 ++++++++++++-
>  kernel/sched/ext.h                       |   9 ++
>  kernel/sched/sched.h                     |   1 +
>  tools/sched_ext/include/scx/common.bpf.h |   3 +
>  tools/sched_ext/scx_qmap.bpf.c           | 142 ++++++++++++++++++++++-
>  tools/sched_ext/scx_qmap.c               |   8 ++
>  7 files changed, 252 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 972b7dd65af2..12174c0137a5 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -197,7 +197,9 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
>  
>  static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
>  {
> -	unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
> +	unsigned long min, max;
> +	unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu) +
> +		scx_cpuperf_target(sg_cpu->cpu);

What does cpu_util_cfs_boost() contain if scx is active?
NIT: reverse xmas

>  
>  	util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
>  	util = max(util, boost);
> @@ -330,6 +332,14 @@ static bool sugov_hold_freq(struct sugov_cpu *sg_cpu)
>  	unsigned long idle_calls;
>  	bool ret;
>  
> +	/*
> +	 * The heuristics in this function is for the fair class. For SCX, the
> +	 * performance target comes directly from the BPF scheduler. Let's just
> +	 * follow it.
> +	 */
> +	if (scx_switched_all())
> +		return false;
> +
> [SNIP]
Tejun Heo June 19, 2024, 7:19 p.m. UTC | #2
Hello,

On Wed, Jun 19, 2024 at 03:07:51PM +0100, Christian Loehle wrote:
> >  static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> >  {
> > -	unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
> > +	unsigned long min, max;
> > +	unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu) +
> > +		scx_cpuperf_target(sg_cpu->cpu);
> 
> What does cpu_util_cfs_boost() contain if scx is active?

Hmm... when SCX is in partial mode, cpu_util_cfs_boost() reports the util
metric from cfs side as usual, but, yeah, when scx_switched_all(), this
could be stale. I'll update.

> NIT: reverse xmas

Ah, let me just make the initial assignment a separate statement. Reverse
xmas is awkward with broken lines.

Thanks.
Hongyan Xia July 2, 2024, 10:23 a.m. UTC | #3
On 19/06/2024 04:12, Tejun Heo wrote:
> sched_ext currently does not integrate with schedutil. When schedutil is the
> governor, frequencies are left unregulated and usually get stuck close to
> the highest performance level from running RT tasks.
> 
> Add CPU performance monitoring and scaling support by integrating into
> schedutil. The following kfuncs are added:
> 
> - scx_bpf_cpuperf_cap(): Query the relative performance capacity of
>    different CPUs in the system.
> 
> - scx_bpf_cpuperf_cur(): Query the current performance level of a CPU
>    relative to its max performance.
> 
> - scx_bpf_cpuperf_set(): Set the current target performance level of a CPU.
> 
> This gives direct control over CPU performance setting to the BPF scheduler.
> The only changes on the schedutil side are accounting for the utilization
> factor from sched_ext and disabling frequency holding heuristics as it may
> not apply well to sched_ext schedulers which may have a lot weaker
> connection between tasks and their current / last CPU.
> 
> With cpuperf support added, there is no reason to block uclamp. Enable while
> at it.
> 
> A toy implementation of cpuperf is added to scx_qmap as a demonstration of
> the feature.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: David Vernet <dvernet@meta.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Viresh Kumar <viresh.kumar@linaro.org>
> ---
>   kernel/sched/cpufreq_schedutil.c         |  12 +-
>   kernel/sched/ext.c                       |  83 ++++++++++++-
>   kernel/sched/ext.h                       |   9 ++
>   kernel/sched/sched.h                     |   1 +
>   tools/sched_ext/include/scx/common.bpf.h |   3 +
>   tools/sched_ext/scx_qmap.bpf.c           | 142 ++++++++++++++++++++++-
>   tools/sched_ext/scx_qmap.c               |   8 ++
>   7 files changed, 252 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 972b7dd65af2..12174c0137a5 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -197,7 +197,9 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
>   
>   static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
>   {
> -	unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
> +	unsigned long min, max;
> +	unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu) +
> +		scx_cpuperf_target(sg_cpu->cpu);
>   
>   	util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
>   	util = max(util, boost);
> @@ -330,6 +332,14 @@ static bool sugov_hold_freq(struct sugov_cpu *sg_cpu)
>   	unsigned long idle_calls;
>   	bool ret;
>   
> +	/*
> +	 * The heuristics in this function is for the fair class. For SCX, the
> +	 * performance target comes directly from the BPF scheduler. Let's just
> +	 * follow it.
> +	 */
> +	if (scx_switched_all())
> +		return false;
> +
>   	/* if capped by uclamp_max, always update to be in compliance */
>   	if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)))
>   		return false;
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index f814e84ceeb3..04fb0eeee5ec 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -16,6 +16,8 @@ enum scx_consts {
>   	SCX_EXIT_BT_LEN			= 64,
>   	SCX_EXIT_MSG_LEN		= 1024,
>   	SCX_EXIT_DUMP_DFL_LEN		= 32768,
> +
> +	SCX_CPUPERF_ONE			= SCHED_CAPACITY_SCALE,
>   };
>   
>   enum scx_exit_kind {
> @@ -3520,7 +3522,7 @@ DEFINE_SCHED_CLASS(ext) = {
>   	.update_curr		= update_curr_scx,
>   
>   #ifdef CONFIG_UCLAMP_TASK
> -	.uclamp_enabled		= 0,
> +	.uclamp_enabled		= 1,
>   #endif
>   };
>   

Hi. I know this is a bit late, but the implication of this change here 
can be quite interesting.

With this patch but without switching this knob from 0 to 1, this series 
gives me the perfect opportunity to implement a custom uclamp within 
sched_ext on top of the cpufreq support added. I think this would be 
what some vendors looking at sched_ext would also want. But, if 
.uclamp_enabled == 1, then the mainline uclamp implementation is in 
effect regardless of what ext scheduler is loaded. In fact, 
uclamp_{inc,dec}() are before calling the {enqueue,dequeue}_task() so 
now there's no easy way to circumvent it.

What would be really nice is to have cpufreq support in sched_ext but 
not force uclamp_enabled. But, I also think there will be people who are 
happy with the current uclamp implementation and want to just reuse it. 
The best thing is to let the loaded scheduler decide, somehow, which I 
don't know if there's an easy way to do this yet.

> [...]
Tejun Heo July 2, 2024, 4:37 p.m. UTC | #4
Hello, Hongyan.

On Tue, Jul 02, 2024 at 11:23:58AM +0100, Hongyan Xia wrote:
> What would be really nice is to have cpufreq support in sched_ext but not
> force uclamp_enabled. But, I also think there will be people who are happy
> with the current uclamp implementation and want to just reuse it. The best
> thing is to let the loaded scheduler decide, somehow, which I don't know if
> there's an easy way to do this yet.

I don't know much about uclamp but at least from sched_ext side, it's
trivial add an ops flag for it and because we know that no tasks are on the
ext class before BPF scheduler is loaded, as long as we switch the
uclamp_enabled value while the BPF scheduler is not loaded, the uclamp
buckets should stay balanced. AFAICS, the only core change we need to make
is mooving the uclamp_enabled bool outside sched_class so that it can be
changed runtime. Is that the case or am I missing something?

Thanks.
Hongyan Xia July 2, 2024, 5:12 p.m. UTC | #5
On 02/07/2024 17:37, Tejun Heo wrote:
> Hello, Hongyan.
> 
> On Tue, Jul 02, 2024 at 11:23:58AM +0100, Hongyan Xia wrote:
>> What would be really nice is to have cpufreq support in sched_ext but not
>> force uclamp_enabled. But, I also think there will be people who are happy
>> with the current uclamp implementation and want to just reuse it. The best
>> thing is to let the loaded scheduler decide, somehow, which I don't know if
>> there's an easy way to do this yet.
> 
> I don't know much about uclamp but at least from sched_ext side, it's
> trivial add an ops flag for it and because we know that no tasks are on the
> ext class before BPF scheduler is loaded, as long as we switch the
> uclamp_enabled value while the BPF scheduler is not loaded, the uclamp
> buckets should stay balanced. AFAICS, the only core change we need to make
> is mooving the uclamp_enabled bool outside sched_class so that it can be
> changed runtime. Is that the case or am I missing something?
> 

Pretty much. Just to clarify what I meant, it would be fantastic if for 
ext, sched_class->uclamp_enabled is decided the moment we load the 
custom scheduler, not globally enabled all the time for all ext 
schedulers, in case the custom scheduler wants to ignore uclamp or has 
its own uclamp implementation. During ext_ops->init(), it would be great 
if the loaded scheduler could decide whether its 
sched_class->uclamp_enabled should be enabled.

However, sched_class->uclamp_enabled is just a normal struct variable, 
so I cannot immediately see a clean way to let the loaded scheduler 
program this field. We might be able to expose a function from the 
kernel side to write sched_class->uclamp_enabled during ext_ops->init(), 
although that looks a bit messy.
Tejun Heo July 2, 2024, 5:56 p.m. UTC | #6
Hello,

So, maybe something like this. It's not the prettiest but avoids adding
indirect calls to fair and rt while allowing sched_ext to report what the
BPF scheduler wants. Only compile tested. Would something like this work for
the use cases you have on mind?

Thanks.

Index: work/kernel/sched/core.c
===================================================================
--- work.orig/kernel/sched/core.c
+++ work/kernel/sched/core.c
@@ -1671,6 +1671,20 @@ static inline void uclamp_rq_dec_id(stru
 	}
 }
 
+bool sched_uclamp_enabled(void)
+{
+	return true;
+}
+
+static bool class_supports_uclamp(const struct sched_class *class)
+{
+	if (likely(class->uclamp_enabled == sched_uclamp_enabled))
+		return true;
+	if (!class->uclamp_enabled)
+		return false;
+	return class->uclamp_enabled();
+}
+
 static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
 {
 	enum uclamp_id clamp_id;
@@ -1684,7 +1698,7 @@ static inline void uclamp_rq_inc(struct
 	if (!static_branch_unlikely(&sched_uclamp_used))
 		return;
 
-	if (unlikely(!p->sched_class->uclamp_enabled))
+	if (class_supports_uclamp(p->sched_class))
 		return;
 
 	for_each_clamp_id(clamp_id)
@@ -1708,7 +1722,7 @@ static inline void uclamp_rq_dec(struct
 	if (!static_branch_unlikely(&sched_uclamp_used))
 		return;
 
-	if (unlikely(!p->sched_class->uclamp_enabled))
+	if (class_supports_uclamp(p->sched_class))
 		return;
 
 	for_each_clamp_id(clamp_id)
Index: work/kernel/sched/ext.c
===================================================================
--- work.orig/kernel/sched/ext.c
+++ work/kernel/sched/ext.c
@@ -116,10 +116,17 @@ enum scx_ops_flags {
 	 */
 	SCX_OPS_SWITCH_PARTIAL	= 1LLU << 3,
 
+	/*
+	 * Disable built-in uclamp support. Can be useful when the BPF scheduler
+	 * wants to implement custom uclamp support.
+	 */
+	SCX_OPS_DISABLE_UCLAMP	= 1LLU << 4,
+
 	SCX_OPS_ALL_FLAGS	= SCX_OPS_KEEP_BUILTIN_IDLE |
 				  SCX_OPS_ENQ_LAST |
 				  SCX_OPS_ENQ_EXITING |
-				  SCX_OPS_SWITCH_PARTIAL,
+				  SCX_OPS_SWITCH_PARTIAL |
+				  SCX_OPS_DISABLE_UCLAMP,
 };
 
 /* argument container for ops.init_task() */
@@ -3437,6 +3444,13 @@ static void switched_from_scx(struct rq
 static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
 static void switched_to_scx(struct rq *rq, struct task_struct *p) {}
 
+#ifdef CONFIG_UCLAMP_TASK
+static bool uclamp_enabled_scx(void)
+{
+	return !(scx_ops.flags & SCX_OPS_DISABLE_UCLAMP);
+}
+#endif
+
 int scx_check_setscheduler(struct task_struct *p, int policy)
 {
 	lockdep_assert_rq_held(task_rq(p));
@@ -3522,7 +3536,7 @@ DEFINE_SCHED_CLASS(ext) = {
 	.update_curr		= update_curr_scx,
 
 #ifdef CONFIG_UCLAMP_TASK
-	.uclamp_enabled		= 1,
+	.uclamp_enabled		= uclamp_enabled_scx,
 #endif
 };
 
Index: work/kernel/sched/fair.c
===================================================================
--- work.orig/kernel/sched/fair.c
+++ work/kernel/sched/fair.c
@@ -13228,9 +13228,7 @@ DEFINE_SCHED_CLASS(fair) = {
 	.task_is_throttled	= task_is_throttled_fair,
 #endif
 
-#ifdef CONFIG_UCLAMP_TASK
-	.uclamp_enabled		= 1,
-#endif
+	SCHED_CLASS_UCLAMP_ENABLED
 };
 
 #ifdef CONFIG_SCHED_DEBUG
Index: work/kernel/sched/rt.c
===================================================================
--- work.orig/kernel/sched/rt.c
+++ work/kernel/sched/rt.c
@@ -2681,9 +2681,7 @@ DEFINE_SCHED_CLASS(rt) = {
 	.task_is_throttled	= task_is_throttled_rt,
 #endif
 
-#ifdef CONFIG_UCLAMP_TASK
-	.uclamp_enabled		= 1,
-#endif
+	SCHED_CLASS_UCLAMP_ENABLED
 };
 
 #ifdef CONFIG_RT_GROUP_SCHED
Index: work/kernel/sched/sched.h
===================================================================
--- work.orig/kernel/sched/sched.h
+++ work/kernel/sched/sched.h
@@ -2339,11 +2339,6 @@ struct affinity_context {
 extern s64 update_curr_common(struct rq *rq);
 
 struct sched_class {
-
-#ifdef CONFIG_UCLAMP_TASK
-	int uclamp_enabled;
-#endif
-
 	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
 	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
 	void (*yield_task)   (struct rq *rq);
@@ -2405,8 +2400,21 @@ struct sched_class {
 #ifdef CONFIG_SCHED_CORE
 	int (*task_is_throttled)(struct task_struct *p, int cpu);
 #endif
+
+#ifdef CONFIG_UCLAMP_TASK
+	bool (*uclamp_enabled)(void);
+#endif
 };
 
+#ifdef CONFIG_UCLAMP_TASK
+bool sched_uclamp_enabled(void);
+
+#define SCHED_CLASS_UCLAMP_ENABLED	\
+	.uclamp_enabled = sched_uclamp_enabled,
+#else
+#define SCHED_CLASS_UCLAMP_ENABLED
+#endif
+
 static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
 {
 	WARN_ON_ONCE(rq->curr != prev);
Hongyan Xia July 2, 2024, 8:41 p.m. UTC | #7
On 02/07/2024 18:56, Tejun Heo wrote:
> Hello,
> 
> So, maybe something like this. It's not the prettiest but avoids adding
> indirect calls to fair and rt while allowing sched_ext to report what the
> BPF scheduler wants. Only compile tested. Would something like this work for
> the use cases you have on mind?
> 
> Thanks.
> 
> Index: work/kernel/sched/core.c
> ===================================================================
> --- work.orig/kernel/sched/core.c
> +++ work/kernel/sched/core.c
> @@ -1671,6 +1671,20 @@ static inline void uclamp_rq_dec_id(stru
>   	}
>   }
>   
> +bool sched_uclamp_enabled(void)
> +{
> +	return true;
> +}
> +
> +static bool class_supports_uclamp(const struct sched_class *class)
> +{
> +	if (likely(class->uclamp_enabled == sched_uclamp_enabled))
> +		return true;
> +	if (!class->uclamp_enabled)
> +		return false;
> +	return class->uclamp_enabled();
> +}
> +
>   static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p)
>   {
>   	enum uclamp_id clamp_id;
> @@ -1684,7 +1698,7 @@ static inline void uclamp_rq_inc(struct
>   	if (!static_branch_unlikely(&sched_uclamp_used))
>   		return;
>   
> -	if (unlikely(!p->sched_class->uclamp_enabled))
> +	if (class_supports_uclamp(p->sched_class))
>   		return;
>   
>   	for_each_clamp_id(clamp_id)
> @@ -1708,7 +1722,7 @@ static inline void uclamp_rq_dec(struct
>   	if (!static_branch_unlikely(&sched_uclamp_used))
>   		return;
>   
> -	if (unlikely(!p->sched_class->uclamp_enabled))
> +	if (class_supports_uclamp(p->sched_class))
>   		return;
>   
>   	for_each_clamp_id(clamp_id)
> Index: work/kernel/sched/ext.c
> ===================================================================
> --- work.orig/kernel/sched/ext.c
> +++ work/kernel/sched/ext.c
> @@ -116,10 +116,17 @@ enum scx_ops_flags {
>   	 */
>   	SCX_OPS_SWITCH_PARTIAL	= 1LLU << 3,
>   
> +	/*
> +	 * Disable built-in uclamp support. Can be useful when the BPF scheduler
> +	 * wants to implement custom uclamp support.
> +	 */
> +	SCX_OPS_DISABLE_UCLAMP	= 1LLU << 4,
> +
>   	SCX_OPS_ALL_FLAGS	= SCX_OPS_KEEP_BUILTIN_IDLE |
>   				  SCX_OPS_ENQ_LAST |
>   				  SCX_OPS_ENQ_EXITING |
> -				  SCX_OPS_SWITCH_PARTIAL,
> +				  SCX_OPS_SWITCH_PARTIAL |
> +				  SCX_OPS_DISABLE_UCLAMP,
>   };
>   
>   /* argument container for ops.init_task() */
> @@ -3437,6 +3444,13 @@ static void switched_from_scx(struct rq
>   static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
>   static void switched_to_scx(struct rq *rq, struct task_struct *p) {}
>   
> +#ifdef CONFIG_UCLAMP_TASK
> +static bool uclamp_enabled_scx(void)
> +{
> +	return !(scx_ops.flags & SCX_OPS_DISABLE_UCLAMP);
> +}
> +#endif
> +
>   int scx_check_setscheduler(struct task_struct *p, int policy)
>   {
>   	lockdep_assert_rq_held(task_rq(p));
> @@ -3522,7 +3536,7 @@ DEFINE_SCHED_CLASS(ext) = {
>   	.update_curr		= update_curr_scx,
>   
>   #ifdef CONFIG_UCLAMP_TASK
> -	.uclamp_enabled		= 1,
> +	.uclamp_enabled		= uclamp_enabled_scx,
>   #endif
>   };
>   
> Index: work/kernel/sched/fair.c
> ===================================================================
> --- work.orig/kernel/sched/fair.c
> +++ work/kernel/sched/fair.c
> @@ -13228,9 +13228,7 @@ DEFINE_SCHED_CLASS(fair) = {
>   	.task_is_throttled	= task_is_throttled_fair,
>   #endif
>   
> -#ifdef CONFIG_UCLAMP_TASK
> -	.uclamp_enabled		= 1,
> -#endif
> +	SCHED_CLASS_UCLAMP_ENABLED
>   };
>   
>   #ifdef CONFIG_SCHED_DEBUG
> Index: work/kernel/sched/rt.c
> ===================================================================
> --- work.orig/kernel/sched/rt.c
> +++ work/kernel/sched/rt.c
> @@ -2681,9 +2681,7 @@ DEFINE_SCHED_CLASS(rt) = {
>   	.task_is_throttled	= task_is_throttled_rt,
>   #endif
>   
> -#ifdef CONFIG_UCLAMP_TASK
> -	.uclamp_enabled		= 1,
> -#endif
> +	SCHED_CLASS_UCLAMP_ENABLED
>   };
>   
>   #ifdef CONFIG_RT_GROUP_SCHED
> Index: work/kernel/sched/sched.h
> ===================================================================
> --- work.orig/kernel/sched/sched.h
> +++ work/kernel/sched/sched.h
> @@ -2339,11 +2339,6 @@ struct affinity_context {
>   extern s64 update_curr_common(struct rq *rq);
>   
>   struct sched_class {
> -
> -#ifdef CONFIG_UCLAMP_TASK
> -	int uclamp_enabled;
> -#endif
> -
>   	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
>   	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
>   	void (*yield_task)   (struct rq *rq);
> @@ -2405,8 +2400,21 @@ struct sched_class {
>   #ifdef CONFIG_SCHED_CORE
>   	int (*task_is_throttled)(struct task_struct *p, int cpu);
>   #endif
> +
> +#ifdef CONFIG_UCLAMP_TASK
> +	bool (*uclamp_enabled)(void);
> +#endif
>   };
>   
> +#ifdef CONFIG_UCLAMP_TASK
> +bool sched_uclamp_enabled(void);
> +
> +#define SCHED_CLASS_UCLAMP_ENABLED	\
> +	.uclamp_enabled = sched_uclamp_enabled,
> +#else
> +#define SCHED_CLASS_UCLAMP_ENABLED
> +#endif
> +
>   static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
>   {
>   	WARN_ON_ONCE(rq->curr != prev);

Looks good to me!

Actually, if we are okay with changing the sched_class struct and 
touching the code of other classes, I wonder if a cleaner solution is 
just to completely remove sched_class->uclamp_enabled and let each class 
decide what to do in enqueue and dequeue, so instead of

	uclamp_inc/dec();
	p->sched_class->enqueue/dequeue_task();

we can just

	p->sched_class->enqueue/dequeue_task();
		do_uclamp_inside_each_class();

and we export uclamp_inc/dec() functions from core.c to RT, fair and 
ext. For RT and fair, just

	enqueue/dequeue_task_fair/rt();
		uclamp_inc/dec();

For ext, maybe expose bpf_uclamp_inc/dec() to the custom scheduler. If a 
scheduler wants to reuse the current uclamp implementation, just call 
these. If not, skip them and implement its own.
Tejun Heo July 2, 2024, 9:12 p.m. UTC | #8
Hello,

On Tue, Jul 02, 2024 at 09:41:30PM +0100, Hongyan Xia wrote:
...
> Actually, if we are okay with changing the sched_class struct and touching
> the code of other classes, I wonder if a cleaner solution is just to
> completely remove sched_class->uclamp_enabled and let each class decide what
> to do in enqueue and dequeue, so instead of
> 
> 	uclamp_inc/dec();
> 	p->sched_class->enqueue/dequeue_task();
> 
> we can just
> 
> 	p->sched_class->enqueue/dequeue_task();
> 		do_uclamp_inside_each_class();
> 
> and we export uclamp_inc/dec() functions from core.c to RT, fair and ext.
> For RT and fair, just
> 
> 	enqueue/dequeue_task_fair/rt();
> 		uclamp_inc/dec();
>
> For ext, maybe expose bpf_uclamp_inc/dec() to the custom scheduler. If a
> scheduler wants to reuse the current uclamp implementation, just call these.
> If not, skip them and implement its own.

That does sound a lot better. Mind writing up a patchset?

Thanks.
Qais Yousef July 24, 2024, 11:45 p.m. UTC | #9
On 06/18/24 17:12, Tejun Heo wrote:
> sched_ext currently does not integrate with schedutil. When schedutil is the
> governor, frequencies are left unregulated and usually get stuck close to
> the highest performance level from running RT tasks.

Have you tried to investigate why is that? By default RT run at max frequency.
Only way to prevent them from doing that is by using uclamp

	https://kernel.org/doc/html/latest/scheduler/sched-util-clamp.html#sched-util-clamp-min-rt-default

If that's not the cause, then it's likely something else is broken.

> 
> Add CPU performance monitoring and scaling support by integrating into
> schedutil. The following kfuncs are added:
> 
> - scx_bpf_cpuperf_cap(): Query the relative performance capacity of
>   different CPUs in the system.
> 
> - scx_bpf_cpuperf_cur(): Query the current performance level of a CPU
>   relative to its max performance.
> 
> - scx_bpf_cpuperf_set(): Set the current target performance level of a CPU.

What is exactly the problem you're seeing? You shouldn't need to set
performance directly. Are you trying to fix a problem, or add a new feature?

> 
> This gives direct control over CPU performance setting to the BPF scheduler.

Why would we need to do that?  schedutil is supposed to operate in utilization
signal. Overriding it with custom unknown changes makes it all random governor
based on what's current bpf sched_ext is loaded? This make bug reports and
debugging problems a lot harder.

I do hope by the way that loading external scheduler does cause the kernel to
be tainted. With these random changes, it's hard to know if it is a problem in
the kernel or with external out of tree entity. Out of tree modules taint the
kernel, so should loading sched_ext.

It should not cause spurious reports, nor prevent us from changing the code
without worrying about breaking out of tree code.

> The only changes on the schedutil side are accounting for the utilization
> factor from sched_ext and disabling frequency holding heuristics as it may
> not apply well to sched_ext schedulers which may have a lot weaker
> connection between tasks and their current / last CPU.
> 
> With cpuperf support added, there is no reason to block uclamp. Enable while
> at it.
> 
> A toy implementation of cpuperf is added to scx_qmap as a demonstration of
> the feature.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: David Vernet <dvernet@meta.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Viresh Kumar <viresh.kumar@linaro.org>
> ---
>  kernel/sched/cpufreq_schedutil.c         |  12 +-
>  kernel/sched/ext.c                       |  83 ++++++++++++-
>  kernel/sched/ext.h                       |   9 ++
>  kernel/sched/sched.h                     |   1 +
>  tools/sched_ext/include/scx/common.bpf.h |   3 +
>  tools/sched_ext/scx_qmap.bpf.c           | 142 ++++++++++++++++++++++-
>  tools/sched_ext/scx_qmap.c               |   8 ++
>  7 files changed, 252 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 972b7dd65af2..12174c0137a5 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -197,7 +197,9 @@ unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
>  
>  static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
>  {
> -	unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
> +	unsigned long min, max;
> +	unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu) +
> +		scx_cpuperf_target(sg_cpu->cpu);
>  
>  	util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
>  	util = max(util, boost);
> @@ -330,6 +332,14 @@ static bool sugov_hold_freq(struct sugov_cpu *sg_cpu)
>  	unsigned long idle_calls;
>  	bool ret;
>  
> +	/*
> +	 * The heuristics in this function is for the fair class. For SCX, the
> +	 * performance target comes directly from the BPF scheduler. Let's just
> +	 * follow it.
> +	 */
> +	if (scx_switched_all())
> +		return false;

Why do you need to totally override? What problems did you find in current util
value and what have you done to try to fix it first rather than override it
completely?

> +
>  	/* if capped by uclamp_max, always update to be in compliance */
>  	if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)))
>  		return false;
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index f814e84ceeb3..04fb0eeee5ec 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -16,6 +16,8 @@ enum scx_consts {
>  	SCX_EXIT_BT_LEN			= 64,
>  	SCX_EXIT_MSG_LEN		= 1024,
>  	SCX_EXIT_DUMP_DFL_LEN		= 32768,
> +
> +	SCX_CPUPERF_ONE			= SCHED_CAPACITY_SCALE,
>  };
>  
>  enum scx_exit_kind {
> @@ -3520,7 +3522,7 @@ DEFINE_SCHED_CLASS(ext) = {
>  	.update_curr		= update_curr_scx,
>  
>  #ifdef CONFIG_UCLAMP_TASK
> -	.uclamp_enabled		= 0,
> +	.uclamp_enabled		= 1,
>  #endif
>  };
>  
> @@ -4393,7 +4395,7 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
>  	struct scx_task_iter sti;
>  	struct task_struct *p;
>  	unsigned long timeout;
> -	int i, ret;
> +	int i, cpu, ret;
>  
>  	mutex_lock(&scx_ops_enable_mutex);
>  
> @@ -4442,6 +4444,9 @@ static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
>  
>  	atomic_long_set(&scx_nr_rejected, 0);
>  
> +	for_each_possible_cpu(cpu)
> +		cpu_rq(cpu)->scx.cpuperf_target = SCX_CPUPERF_ONE;
> +
>  	/*
>  	 * Keep CPUs stable during enable so that the BPF scheduler can track
>  	 * online CPUs by watching ->on/offline_cpu() after ->init().
> @@ -5835,6 +5840,77 @@ __bpf_kfunc void scx_bpf_dump_bstr(char *fmt, unsigned long long *data,
>  		ops_dump_flush();
>  }
>  
> +/**
> + * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU
> + * @cpu: CPU of interest
> + *
> + * Return the maximum relative capacity of @cpu in relation to the most
> + * performant CPU in the system. The return value is in the range [1,
> + * %SCX_CPUPERF_ONE]. See scx_bpf_cpuperf_cur().
> + */
> +__bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu)
> +{
> +	if (ops_cpu_valid(cpu, NULL))
> +		return arch_scale_cpu_capacity(cpu);
> +	else
> +		return SCX_CPUPERF_ONE;
> +}

Hmm. This is tricky. It looks fine, but I worry about changing how we want to
handle capacities in the future and then being tied down forever with out of
tree sched_ext not being able to load.

How are we going to protect against such potential changes? Just make it a NOP?

A bit hypothetical but so far these are considered internal scheduler details
that could change anytime with no consequence. With this attaching to this info
changing them will become a lot harder as there's external dependencies that
will fail to load or work properly. And what is the regression rule in this
case?

You should make all functions return an error to future proof them against
suddenly disappearing.

> +
> +/**
> + * scx_bpf_cpuperf_cur - Query the current relative performance of a CPU
> + * @cpu: CPU of interest
> + *
> + * Return the current relative performance of @cpu in relation to its maximum.
> + * The return value is in the range [1, %SCX_CPUPERF_ONE].
> + *
> + * The current performance level of a CPU in relation to the maximum performance
> + * available in the system can be calculated as follows:
> + *
> + *   scx_bpf_cpuperf_cap() * scx_bpf_cpuperf_cur() / %SCX_CPUPERF_ONE
> + *
> + * The result is in the range [1, %SCX_CPUPERF_ONE].
> + */
> +__bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu)
> +{
> +	if (ops_cpu_valid(cpu, NULL))
> +		return arch_scale_freq_capacity(cpu);
> +	else
> +		return SCX_CPUPERF_ONE;
> +}
> +
> +/**
> + * scx_bpf_cpuperf_set - Set the relative performance target of a CPU
> + * @cpu: CPU of interest
> + * @perf: target performance level [0, %SCX_CPUPERF_ONE]
> + * @flags: %SCX_CPUPERF_* flags
> + *
> + * Set the target performance level of @cpu to @perf. @perf is in linear
> + * relative scale between 0 and %SCX_CPUPERF_ONE. This determines how the
> + * schedutil cpufreq governor chooses the target frequency.
> + *
> + * The actual performance level chosen, CPU grouping, and the overhead and
> + * latency of the operations are dependent on the hardware and cpufreq driver in
> + * use. Consult hardware and cpufreq documentation for more information. The
> + * current performance level can be monitored using scx_bpf_cpuperf_cur().
> + */
> +__bpf_kfunc void scx_bpf_cpuperf_set(u32 cpu, u32 perf)
> +{
> +	if (unlikely(perf > SCX_CPUPERF_ONE)) {
> +		scx_ops_error("Invalid cpuperf target %u for CPU %d", perf, cpu);
> +		return;
> +	}
> +
> +	if (ops_cpu_valid(cpu, NULL)) {
> +		struct rq *rq = cpu_rq(cpu);
> +
> +		rq->scx.cpuperf_target = perf;
> +
> +		rcu_read_lock_sched_notrace();
> +		cpufreq_update_util(cpu_rq(cpu), 0);
> +		rcu_read_unlock_sched_notrace();
> +	}
> +}

Is the problem that you break how util signal works in sched_ext? Or you want
the fine control? We expect user application to use uclamp to set their perf
requirement. And sched_ext should not break util signal, no? If it does and
there's a good reason for it, then it is not compatible with schedutil, as the
name indicates it operates on util signal as defined in PELT.

You can always use min_freq/max_freq in sysfs to force min and max frequencies
without hacking the governor. I don't advise it though and I'd recommend trying
to be compatible with schedutil as-is rather than modify it. Consistency is
a key.


Thanks

--
Qais Yousef
Tejun Heo July 31, 2024, 1:05 a.m. UTC | #10
Hello, Qais.

On Thu, Jul 25, 2024 at 12:45:27AM +0100, Qais Yousef wrote:
> On 06/18/24 17:12, Tejun Heo wrote:
> > sched_ext currently does not integrate with schedutil. When schedutil is the
> > governor, frequencies are left unregulated and usually get stuck close to
> > the highest performance level from running RT tasks.
> 
> Have you tried to investigate why is that? By default RT run at max frequency.
> Only way to prevent them from doing that is by using uclamp
> 
> 	https://kernel.org/doc/html/latest/scheduler/sched-util-clamp.html#sched-util-clamp-min-rt-default
> 
> If that's not the cause, then it's likely something else is broken.

Nothing is necessarily broken. SCX just wasn't providing any signal to
schedutil before this patch, so schedutil ends up just going by with the
occasional signals from RT.

...
> What is exactly the problem you're seeing? You shouldn't need to set
> performance directly. Are you trying to fix a problem, or add a new feature?

I'm having a hard time following the question. When active, the BPF
scheduler is the only one who can tell how much each CPU is being used, and
the straightforward to integrate with schedutil is by indicating what the
current CPU utilization level is which is the signal that schedutil takes
from each scheduler class.

> > This gives direct control over CPU performance setting to the BPF scheduler.
> 
> Why would we need to do that?  schedutil is supposed to operate in utilization

Because nobody else knows? No one *can* know. Schedutil is driven by the
utilization signal from the scheduler. Here, the scheduler is implemented in
BPF. Nothing else knows how much workload each CPU is going to get. Note
that the kernel side does not have any meaningful information re. task <->
CPU relationship. That can change on every dispatch. Only the BPF scheduler
itself knows how the CPUs are going to be used.

> signal. Overriding it with custom unknown changes makes it all random governor
> based on what's current bpf sched_ext is loaded? This make bug reports and
> debugging problems a lot harder.

If schedutil is misbehaving while an SCX scheduler is loaded, it's the BPF
scheduler's fault. There isn't much else to it.

> I do hope by the way that loading external scheduler does cause the kernel to
> be tainted. With these random changes, it's hard to know if it is a problem in
> the kernel or with external out of tree entity. Out of tree modules taint the
> kernel, so should loading sched_ext.

Out of tree modules can cause corruptions which can have lasting impacts on
the system even after the module is unloaded. That's why we keep the
persistent taint flags - to remember that the current misbehavior has a
reasonable chance of being impacted by what happened to the system
beforehand. Kernel bugs aside, SCX schedulers shouldn't leave persistent
impacts on the system. In those cases, we usually mark the dumps which SCX
already does.

> It should not cause spurious reports, nor prevent us from changing the code
> without worrying about breaking out of tree code.

Oh yeah, we can and will make compatibility breaking changes. Not
willy-nilly, hopefully.

...
> > +	/*
> > +	 * The heuristics in this function is for the fair class. For SCX, the
> > +	 * performance target comes directly from the BPF scheduler. Let's just
> > +	 * follow it.
> > +	 */
> > +	if (scx_switched_all())
> > +		return false;
> 
> Why do you need to totally override? What problems did you find in current util
> value and what have you done to try to fix it first rather than override it
> completely?

Because it's way cleaner this way. Otherwise, we need to keep calling into
update_blocked_averages() so that the fair class's util metrics can decay
(it often doesn't decay completely due to math inaccuracies but Vincent was
looking at it). When switched_all(), the fair class is not used at all and
keeping calling it to decay the utility metrics to zero seems kinda silly.

...
> > +/**
> > + * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU
> > + * @cpu: CPU of interest
> > + *
> > + * Return the maximum relative capacity of @cpu in relation to the most
> > + * performant CPU in the system. The return value is in the range [1,
> > + * %SCX_CPUPERF_ONE]. See scx_bpf_cpuperf_cur().
> > + */
> > +__bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu)
> > +{
> > +	if (ops_cpu_valid(cpu, NULL))
> > +		return arch_scale_cpu_capacity(cpu);
> > +	else
> > +		return SCX_CPUPERF_ONE;
> > +}
> 
> Hmm. This is tricky. It looks fine, but I worry about changing how we want to
> handle capacities in the future and then being tied down forever with out of
> tree sched_ext not being able to load.
> 
> How are we going to protect against such potential changes? Just make it a NOP?

So, if it's a necessary change, we just break the API and update the
out-of-tree schedulers accordingly. With BPF's CO-RE and supporting
features, we can handle quite a bit of backward compatibility - ie. it's
cumbersome but usually possible to provide BPF-side helpers that allow
updated BPF schedulers to be loaded in both old and new kernels, so it
usually isn't *that* painful.

> A bit hypothetical but so far these are considered internal scheduler details
> that could change anytime with no consequence. With this attaching to this info
> changing them will become a lot harder as there's external dependencies that
> will fail to load or work properly. And what is the regression rule in this
> case?

This is not different from any other BPF hooks and has been discussed many
times. As I wrote before, we don't want to change for no reason but we
definitely can change things if necessary.

> You should make all functions return an error to future proof them against
> suddenly disappearing.

Really, no need to be that draconian.

...
> > +__bpf_kfunc void scx_bpf_cpuperf_set(u32 cpu, u32 perf)
> > +{
> > +	if (unlikely(perf > SCX_CPUPERF_ONE)) {
> > +		scx_ops_error("Invalid cpuperf target %u for CPU %d", perf, cpu);
> > +		return;
> > +	}
> > +
> > +	if (ops_cpu_valid(cpu, NULL)) {
> > +		struct rq *rq = cpu_rq(cpu);
> > +
> > +		rq->scx.cpuperf_target = perf;
> > +
> > +		rcu_read_lock_sched_notrace();
> > +		cpufreq_update_util(cpu_rq(cpu), 0);
> > +		rcu_read_unlock_sched_notrace();
> > +	}
> > +}
> 
> Is the problem that you break how util signal works in sched_ext? Or you want
> the fine control? We expect user application to use uclamp to set their perf
> requirement. And sched_ext should not break util signal, no? If it does and
> there's a good reason for it, then it is not compatible with schedutil, as the
> name indicates it operates on util signal as defined in PELT.
> 
> You can always use min_freq/max_freq in sysfs to force min and max frequencies
> without hacking the governor. I don't advise it though and I'd recommend trying
> to be compatible with schedutil as-is rather than modify it. Consistency is
> a key.

It's not hacking the governor and uclamp should work the same way on top.
The BPF scheduler is the only thing that can tell how and how much a given
CPU is going to be used. There is no way for the kernel to project per-CPU
utilization without asking the BPF scheduler.

Thanks.
diff mbox series

Patch

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 972b7dd65af2..12174c0137a5 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -197,7 +197,9 @@  unsigned long sugov_effective_cpu_perf(int cpu, unsigned long actual,
 
 static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
 {
-	unsigned long min, max, util = cpu_util_cfs_boost(sg_cpu->cpu);
+	unsigned long min, max;
+	unsigned long util = cpu_util_cfs_boost(sg_cpu->cpu) +
+		scx_cpuperf_target(sg_cpu->cpu);
 
 	util = effective_cpu_util(sg_cpu->cpu, util, &min, &max);
 	util = max(util, boost);
@@ -330,6 +332,14 @@  static bool sugov_hold_freq(struct sugov_cpu *sg_cpu)
 	unsigned long idle_calls;
 	bool ret;
 
+	/*
+	 * The heuristics in this function is for the fair class. For SCX, the
+	 * performance target comes directly from the BPF scheduler. Let's just
+	 * follow it.
+	 */
+	if (scx_switched_all())
+		return false;
+
 	/* if capped by uclamp_max, always update to be in compliance */
 	if (uclamp_rq_is_capped(cpu_rq(sg_cpu->cpu)))
 		return false;
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index f814e84ceeb3..04fb0eeee5ec 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -16,6 +16,8 @@  enum scx_consts {
 	SCX_EXIT_BT_LEN			= 64,
 	SCX_EXIT_MSG_LEN		= 1024,
 	SCX_EXIT_DUMP_DFL_LEN		= 32768,
+
+	SCX_CPUPERF_ONE			= SCHED_CAPACITY_SCALE,
 };
 
 enum scx_exit_kind {
@@ -3520,7 +3522,7 @@  DEFINE_SCHED_CLASS(ext) = {
 	.update_curr		= update_curr_scx,
 
 #ifdef CONFIG_UCLAMP_TASK
-	.uclamp_enabled		= 0,
+	.uclamp_enabled		= 1,
 #endif
 };
 
@@ -4393,7 +4395,7 @@  static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
 	struct scx_task_iter sti;
 	struct task_struct *p;
 	unsigned long timeout;
-	int i, ret;
+	int i, cpu, ret;
 
 	mutex_lock(&scx_ops_enable_mutex);
 
@@ -4442,6 +4444,9 @@  static int scx_ops_enable(struct sched_ext_ops *ops, struct bpf_link *link)
 
 	atomic_long_set(&scx_nr_rejected, 0);
 
+	for_each_possible_cpu(cpu)
+		cpu_rq(cpu)->scx.cpuperf_target = SCX_CPUPERF_ONE;
+
 	/*
 	 * Keep CPUs stable during enable so that the BPF scheduler can track
 	 * online CPUs by watching ->on/offline_cpu() after ->init().
@@ -5835,6 +5840,77 @@  __bpf_kfunc void scx_bpf_dump_bstr(char *fmt, unsigned long long *data,
 		ops_dump_flush();
 }
 
+/**
+ * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU
+ * @cpu: CPU of interest
+ *
+ * Return the maximum relative capacity of @cpu in relation to the most
+ * performant CPU in the system. The return value is in the range [1,
+ * %SCX_CPUPERF_ONE]. See scx_bpf_cpuperf_cur().
+ */
+__bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu)
+{
+	if (ops_cpu_valid(cpu, NULL))
+		return arch_scale_cpu_capacity(cpu);
+	else
+		return SCX_CPUPERF_ONE;
+}
+
+/**
+ * scx_bpf_cpuperf_cur - Query the current relative performance of a CPU
+ * @cpu: CPU of interest
+ *
+ * Return the current relative performance of @cpu in relation to its maximum.
+ * The return value is in the range [1, %SCX_CPUPERF_ONE].
+ *
+ * The current performance level of a CPU in relation to the maximum performance
+ * available in the system can be calculated as follows:
+ *
+ *   scx_bpf_cpuperf_cap() * scx_bpf_cpuperf_cur() / %SCX_CPUPERF_ONE
+ *
+ * The result is in the range [1, %SCX_CPUPERF_ONE].
+ */
+__bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu)
+{
+	if (ops_cpu_valid(cpu, NULL))
+		return arch_scale_freq_capacity(cpu);
+	else
+		return SCX_CPUPERF_ONE;
+}
+
+/**
+ * scx_bpf_cpuperf_set - Set the relative performance target of a CPU
+ * @cpu: CPU of interest
+ * @perf: target performance level [0, %SCX_CPUPERF_ONE]
+ * @flags: %SCX_CPUPERF_* flags
+ *
+ * Set the target performance level of @cpu to @perf. @perf is in linear
+ * relative scale between 0 and %SCX_CPUPERF_ONE. This determines how the
+ * schedutil cpufreq governor chooses the target frequency.
+ *
+ * The actual performance level chosen, CPU grouping, and the overhead and
+ * latency of the operations are dependent on the hardware and cpufreq driver in
+ * use. Consult hardware and cpufreq documentation for more information. The
+ * current performance level can be monitored using scx_bpf_cpuperf_cur().
+ */
+__bpf_kfunc void scx_bpf_cpuperf_set(u32 cpu, u32 perf)
+{
+	if (unlikely(perf > SCX_CPUPERF_ONE)) {
+		scx_ops_error("Invalid cpuperf target %u for CPU %d", perf, cpu);
+		return;
+	}
+
+	if (ops_cpu_valid(cpu, NULL)) {
+		struct rq *rq = cpu_rq(cpu);
+
+		rq->scx.cpuperf_target = perf;
+
+		rcu_read_lock_sched_notrace();
+		cpufreq_update_util(cpu_rq(cpu), 0);
+		rcu_read_unlock_sched_notrace();
+	}
+}
+
 /**
  * scx_bpf_nr_cpu_ids - Return the number of possible CPU IDs
  *
@@ -6045,6 +6121,9 @@  BTF_ID_FLAGS(func, scx_bpf_destroy_dsq)
 BTF_ID_FLAGS(func, scx_bpf_exit_bstr, KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_dump_bstr, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap)
+BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur)
+BTF_ID_FLAGS(func, scx_bpf_cpuperf_set)
 BTF_ID_FLAGS(func, scx_bpf_nr_cpu_ids)
 BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE)
diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
index c41d742b5d62..c7ac33c47b68 100644
--- a/kernel/sched/ext.h
+++ b/kernel/sched/ext.h
@@ -48,6 +48,14 @@  int scx_check_setscheduler(struct task_struct *p, int policy);
 bool task_should_scx(struct task_struct *p);
 void init_sched_ext_class(void);
 
+static inline u32 scx_cpuperf_target(s32 cpu)
+{
+	if (scx_enabled())
+		return cpu_rq(cpu)->scx.cpuperf_target;
+	else
+		return 0;
+}
+
 static inline const struct sched_class *next_active_class(const struct sched_class *class)
 {
 	class++;
@@ -89,6 +97,7 @@  static inline void scx_pre_fork(struct task_struct *p) {}
 static inline int scx_fork(struct task_struct *p) { return 0; }
 static inline void scx_post_fork(struct task_struct *p) {}
 static inline void scx_cancel_fork(struct task_struct *p) {}
+static inline u32 scx_cpuperf_target(s32 cpu) { return 0; }
 static inline bool scx_can_stop_tick(struct rq *rq) { return true; }
 static inline void scx_rq_activate(struct rq *rq) {}
 static inline void scx_rq_deactivate(struct rq *rq) {}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c0d6e42c99cc..d3912cf3c3b7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -743,6 +743,7 @@  struct scx_rq {
 	u64			extra_enq_flags;	/* see move_task_to_local_dsq() */
 	u32			nr_running;
 	u32			flags;
+	u32			cpuperf_target;		/* [0, SCHED_CAPACITY_SCALE] */
 	bool			cpu_released;
 	cpumask_var_t		cpus_to_kick;
 	cpumask_var_t		cpus_to_kick_if_idle;
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 3fa87084cf17..dbbda0e35c5d 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -42,6 +42,9 @@  void scx_bpf_destroy_dsq(u64 dsq_id) __ksym;
 void scx_bpf_exit_bstr(s64 exit_code, char *fmt, unsigned long long *data, u32 data__sz) __ksym __weak;
 void scx_bpf_error_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym;
 void scx_bpf_dump_bstr(char *fmt, unsigned long long *data, u32 data_len) __ksym __weak;
+u32 scx_bpf_cpuperf_cap(s32 cpu) __ksym __weak;
+u32 scx_bpf_cpuperf_cur(s32 cpu) __ksym __weak;
+void scx_bpf_cpuperf_set(s32 cpu, u32 perf) __ksym __weak;
 u32 scx_bpf_nr_cpu_ids(void) __ksym __weak;
 const struct cpumask *scx_bpf_get_possible_cpumask(void) __ksym __weak;
 const struct cpumask *scx_bpf_get_online_cpumask(void) __ksym __weak;
diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index c75c70d6a8eb..b1d0b09c966e 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -68,6 +68,18 @@  struct {
 	},
 };
 
+/*
+ * If enabled, CPU performance target is set according to the queue index
+ * according to the following table.
+ */
+static const u32 qidx_to_cpuperf_target[] = {
+	[0] = SCX_CPUPERF_ONE * 0 / 4,
+	[1] = SCX_CPUPERF_ONE * 1 / 4,
+	[2] = SCX_CPUPERF_ONE * 2 / 4,
+	[3] = SCX_CPUPERF_ONE * 3 / 4,
+	[4] = SCX_CPUPERF_ONE * 4 / 4,
+};
+
 /*
  * Per-queue sequence numbers to implement core-sched ordering.
  *
@@ -95,6 +107,8 @@  struct {
 struct cpu_ctx {
 	u64	dsp_idx;	/* dispatch index */
 	u64	dsp_cnt;	/* remaining count */
+	u32	avg_weight;
+	u32	cpuperf_target;
 };
 
 struct {
@@ -107,6 +121,8 @@  struct {
 /* Statistics */
 u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_dequeued;
 u64 nr_core_sched_execed;
+u32 cpuperf_min, cpuperf_avg, cpuperf_max;
+u32 cpuperf_target_min, cpuperf_target_avg, cpuperf_target_max;
 
 s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
 		   s32 prev_cpu, u64 wake_flags)
@@ -313,6 +329,29 @@  void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
 	}
 }
 
+void BPF_STRUCT_OPS(qmap_tick, struct task_struct *p)
+{
+	struct cpu_ctx *cpuc;
+	u32 zero = 0;
+	int idx;
+
+	if (!(cpuc = bpf_map_lookup_elem(&cpu_ctx_stor, &zero))) {
+		scx_bpf_error("failed to look up cpu_ctx");
+		return;
+	}
+
+	/*
+	 * Use the running avg of weights to select the target cpuperf level.
+	 * This is a demonstration of the cpuperf feature rather than a
+	 * practical strategy to regulate CPU frequency.
+	 */
+	cpuc->avg_weight = cpuc->avg_weight * 3 / 4 + p->scx.weight / 4;
+	idx = weight_to_idx(cpuc->avg_weight);
+	cpuc->cpuperf_target = qidx_to_cpuperf_target[idx];
+
+	scx_bpf_cpuperf_set(scx_bpf_task_cpu(p), cpuc->cpuperf_target);
+}
+
 /*
  * The distance from the head of the queue scaled by the weight of the queue.
  * The lower the number, the older the task and the higher the priority.
@@ -422,8 +461,9 @@  void BPF_STRUCT_OPS(qmap_dump_cpu, struct scx_dump_ctx *dctx, s32 cpu, bool idle
 	if (!(cpuc = bpf_map_lookup_percpu_elem(&cpu_ctx_stor, &zero, cpu)))
 		return;
 
-	scx_bpf_dump("QMAP: dsp_idx=%llu dsp_cnt=%llu",
-		     cpuc->dsp_idx, cpuc->dsp_cnt);
+	scx_bpf_dump("QMAP: dsp_idx=%llu dsp_cnt=%llu avg_weight=%u cpuperf_target=%u",
+		     cpuc->dsp_idx, cpuc->dsp_cnt, cpuc->avg_weight,
+		     cpuc->cpuperf_target);
 }
 
 void BPF_STRUCT_OPS(qmap_dump_task, struct scx_dump_ctx *dctx, struct task_struct *p)
@@ -492,11 +532,106 @@  void BPF_STRUCT_OPS(qmap_cpu_offline, s32 cpu)
 	print_cpus();
 }
 
+struct monitor_timer {
+	struct bpf_timer timer;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, u32);
+	__type(value, struct monitor_timer);
+} monitor_timer SEC(".maps");
+
+/*
+ * Print out the min, avg and max performance levels of CPUs every second to
+ * demonstrate the cpuperf interface.
+ */
+static void monitor_cpuperf(void)
+{
+	u32 zero = 0, nr_cpu_ids;
+	u64 cap_sum = 0, cur_sum = 0, cur_min = SCX_CPUPERF_ONE, cur_max = 0;
+	u64 target_sum = 0, target_min = SCX_CPUPERF_ONE, target_max = 0;
+	const struct cpumask *online;
+	int i, nr_online_cpus = 0;
+
+	nr_cpu_ids = scx_bpf_nr_cpu_ids();
+	online = scx_bpf_get_online_cpumask();
+
+	bpf_for(i, 0, nr_cpu_ids) {
+		struct cpu_ctx *cpuc;
+		u32 cap, cur;
+
+		if (!bpf_cpumask_test_cpu(i, online))
+			continue;
+		nr_online_cpus++;
+
+		/* collect the capacity and current cpuperf */
+		cap = scx_bpf_cpuperf_cap(i);
+		cur = scx_bpf_cpuperf_cur(i);
+
+		cur_min = cur < cur_min ? cur : cur_min;
+		cur_max = cur > cur_max ? cur : cur_max;
+
+		/*
+		 * $cur is relative to $cap. Scale it down accordingly so that
+		 * it's in the same scale as other CPUs and $cur_sum/$cap_sum
+		 * makes sense.
+		 */
+		cur_sum += cur * cap / SCX_CPUPERF_ONE;
+		cap_sum += cap;
+
+		if (!(cpuc = bpf_map_lookup_percpu_elem(&cpu_ctx_stor, &zero, i))) {
+			scx_bpf_error("failed to look up cpu_ctx");
+			goto out;
+		}
+
+		/* collect target */
+		cur = cpuc->cpuperf_target;
+		target_sum += cur;
+		target_min = cur < target_min ? cur : target_min;
+		target_max = cur > target_max ? cur : target_max;
+	}
+
+	cpuperf_min = cur_min;
+	cpuperf_avg = cur_sum * SCX_CPUPERF_ONE / cap_sum;
+	cpuperf_max = cur_max;
+
+	cpuperf_target_min = target_min;
+	cpuperf_target_avg = target_sum / nr_online_cpus;
+	cpuperf_target_max = target_max;
+out:
+	scx_bpf_put_cpumask(online);
+}
+
+static int monitor_timerfn(void *map, int *key, struct bpf_timer *timer)
+{
+	monitor_cpuperf();
+
+	bpf_timer_start(timer, ONE_SEC_IN_NS, 0);
+	return 0;
+}
+
 s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
 {
+	u32 key = 0;
+	struct bpf_timer *timer;
+	s32 ret;
+
 	print_cpus();
 
-	return scx_bpf_create_dsq(SHARED_DSQ, -1);
+	ret = scx_bpf_create_dsq(SHARED_DSQ, -1);
+	if (ret)
+		return ret;
+
+	timer = bpf_map_lookup_elem(&monitor_timer, &key);
+	if (!timer)
+		return -ESRCH;
+
+	bpf_timer_init(timer, &monitor_timer, CLOCK_MONOTONIC);
+	bpf_timer_set_callback(timer, monitor_timerfn);
+
+	return bpf_timer_start(timer, ONE_SEC_IN_NS, 0);
 }
 
 void BPF_STRUCT_OPS(qmap_exit, struct scx_exit_info *ei)
@@ -509,6 +644,7 @@  SCX_OPS_DEFINE(qmap_ops,
 	       .enqueue			= (void *)qmap_enqueue,
 	       .dequeue			= (void *)qmap_dequeue,
 	       .dispatch		= (void *)qmap_dispatch,
+	       .tick			= (void *)qmap_tick,
 	       .core_sched_before	= (void *)qmap_core_sched_before,
 	       .cpu_release		= (void *)qmap_cpu_release,
 	       .init_task		= (void *)qmap_init_task,
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index bc36ec4f88a7..4d41c0cb1dab 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -116,6 +116,14 @@  int main(int argc, char **argv)
 		       nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched,
 		       skel->bss->nr_reenqueued, skel->bss->nr_dequeued,
 		       skel->bss->nr_core_sched_execed);
+		if (__COMPAT_has_ksym("scx_bpf_cpuperf_cur"))
+			printf("cpuperf: cur min/avg/max=%u/%u/%u target min/avg/max=%u/%u/%u\n",
+			       skel->bss->cpuperf_min,
+			       skel->bss->cpuperf_avg,
+			       skel->bss->cpuperf_max,
+			       skel->bss->cpuperf_target_min,
+			       skel->bss->cpuperf_target_avg,
+			       skel->bss->cpuperf_target_max);
 		fflush(stdout);
 		sleep(1);
 	}