diff mbox series

[2/2] cpufreq: schedutil: Ignore rate limit when scaling up with FIE present

Message ID 20241212015734.41241-1-sultan@kerneltoast.com (mailing list archive)
State New
Headers show
Series [1/2] cpufreq: schedutil: Fix superfluous updates caused by need_freq_update | expand

Commit Message

Sultan Alsawaf Dec. 12, 2024, 1:57 a.m. UTC
From: "Sultan Alsawaf (unemployed)" <sultan@kerneltoast.com>

When schedutil disregards a frequency transition due to the transition rate
limit, there is no guaranteed deadline as to when the frequency transition
will actually occur after the rate limit expires. For instance, depending
on how long a CPU spends in a preempt/IRQs disabled context, a rate-limited
frequency transition may be delayed indefinitely, until said CPU reaches
the scheduler again. This also hurts tasks boosted via UCLAMP_MIN.

For frequency transitions _down_, this only poses a theoretical loss of
energy savings since a CPU may remain at a higher frequency than necessary
for an indefinite period beyond the rate limit expiry.

For frequency transitions _up_, however, this poses a significant hit to
performance when a CPU is stuck at an insufficient frequency for an
indefinitely long time. In latency-sensitive and bursty workloads
especially, a missed frequency transition up can result in a significant
performance loss due to a CPU operating at an insufficient frequency for
too long.

When support for the Frequency Invariant Engine (FIE) _isn't_ present, a
rate limit is always required for the scheduler to compute CPU utilization
with some semblance of accuracy: any frequency transition that occurs
before the previous transition latches would result in the scheduler not
knowing the frequency a CPU is actually operating at, thereby trashing the
computed CPU utilization.

However, when FIE support _is_ present, there's no technical requirement to
rate limit all frequency transitions to a cpufreq driver's reported
transition latency. With FIE, the scheduler's CPU utilization tracking is
unaffected by any frequency transitions that occur before the previous
frequency is latched.

Therefore, ignore the frequency transition rate limit when scaling up on
systems where FIE is present. This guarantees that transitions to a higher
frequency cannot be indefinitely delayed, since they simply cannot be
delayed at all.

Signed-off-by: Sultan Alsawaf (unemployed) <sultan@kerneltoast.com>
---
 kernel/sched/cpufreq_schedutil.c | 35 +++++++++++++++++++++++++++-----
 1 file changed, 30 insertions(+), 5 deletions(-)
diff mbox series

Patch

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index e51d5ce730be..563baab89a24 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -59,10 +59,15 @@  static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
 
 /************************ Governor internals ***********************/
 
-static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
+static bool sugov_should_rate_limit(struct sugov_policy *sg_policy, u64 time)
 {
-	s64 delta_ns;
+	s64 delta_ns = time - sg_policy->last_freq_update_time;
+
+	return delta_ns < sg_policy->freq_update_delay_ns;
+}
 
+static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
+{
 	/*
 	 * Since cpufreq_update_util() is called with rq->lock held for
 	 * the @target_cpu, our per-CPU data is fully serialized.
@@ -87,17 +92,37 @@  static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time)
 		return true;
 	}
 
-	delta_ns = time - sg_policy->last_freq_update_time;
+	/*
+	 * When frequency-invariant utilization tracking is present, there's no
+	 * rate limit when increasing frequency. Therefore, the next frequency
+	 * must be determined before a decision can be made to rate limit the
+	 * frequency change, hence the rate limit check is bypassed here.
+	 */
+	if (arch_scale_freq_invariant())
+		return true;
 
-	return delta_ns >= sg_policy->freq_update_delay_ns;
+	return !sugov_should_rate_limit(sg_policy, time);
 }
 
 static bool sugov_update_next_freq(struct sugov_policy *sg_policy, u64 time,
 				   unsigned int next_freq)
 {
+	/*
+	 * When a frequency update isn't mandatory (!need_freq_update), the rate
+	 * limit is checked again upon frequency reduction because systems with
+	 * frequency-invariant utilization bypass the rate limit check entirely
+	 * in sugov_should_update_freq(). This is done so that the rate limit
+	 * can be applied only for frequency reduction when frequency-invariant
+	 * utilization is present. Now that the next frequency is known, the
+	 * rate limit can be selectively applied to frequency reduction on such
+	 * systems. A check for arch_scale_freq_invariant() is omitted here
+	 * because unconditionally rechecking the rate limit is cheaper.
+	 */
 	if (sg_policy->need_freq_update)
 		sg_policy->need_freq_update = false;
-	else if (sg_policy->next_freq == next_freq)
+	else if (next_freq == sg_policy->next_freq ||
+		 (next_freq < sg_policy->next_freq &&
+		  sugov_should_rate_limit(sg_policy, time)))
 		return false;
 
 	sg_policy->next_freq = next_freq;