Message ID | 20240501151312.635565-11-tj@kernel.org (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | [01/39] cgroup: Implement cgroup_show_cftypes() | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch, async |
On Wed, May 01, 2024 at 05:09:45AM -1000, Tejun Heo wrote: > RT, DL, thermal and irq load and utilization metrics need to be decayed and > updated periodically and before consumption to keep the numbers reasonable. > This is currently done from __update_blocked_others() as a part of the fair > class load balance path. Let's factor it out to update_other_load_avgs(). > Pure refactor. No functional changes. > > This will be used by the new BPF extensible scheduling class to ensure that > the above metrics are properly maintained. > > Signed-off-by: Tejun Heo <tj@kernel.org> > Reviewed-by: David Vernet <dvernet@meta.com> > --- > kernel/sched/core.c | 19 +++++++++++++++++++ > kernel/sched/fair.c | 16 +++------------- > kernel/sched/sched.h | 3 +++ > 3 files changed, 25 insertions(+), 13 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 90b505fbb488..7542a39f1fde 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -7486,6 +7486,25 @@ int sched_core_idle_cpu(int cpu) > #endif > > #ifdef CONFIG_SMP > +/* > + * Load avg and utiliztion metrics need to be updated periodically and before > + * consumption. This function updates the metrics for all subsystems except for > + * the fair class. @rq must be locked and have its clock updated. > + */ > +bool update_other_load_avgs(struct rq *rq) > +{ > + u64 now = rq_clock_pelt(rq); > + const struct sched_class *curr_class = rq->curr->sched_class; > + unsigned long thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq)); > + > + lockdep_assert_rq_held(rq); > + > + return update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) | > + update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) | > + update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure) | > + update_irq_load_avg(rq, 0); > +} Yeah, but you then ignore the return value and don't call into cpufreq. Vincent, what would be the right thing to do here?
On Mon, 24 Jun 2024 at 14:35, Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, May 01, 2024 at 05:09:45AM -1000, Tejun Heo wrote: > > RT, DL, thermal and irq load and utilization metrics need to be decayed and > > updated periodically and before consumption to keep the numbers reasonable. > > This is currently done from __update_blocked_others() as a part of the fair > > class load balance path. Let's factor it out to update_other_load_avgs(). > > Pure refactor. No functional changes. > > > > This will be used by the new BPF extensible scheduling class to ensure that > > the above metrics are properly maintained. > > > > Signed-off-by: Tejun Heo <tj@kernel.org> > > Reviewed-by: David Vernet <dvernet@meta.com> > > --- > > kernel/sched/core.c | 19 +++++++++++++++++++ > > kernel/sched/fair.c | 16 +++------------- > > kernel/sched/sched.h | 3 +++ > > 3 files changed, 25 insertions(+), 13 deletions(-) > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index 90b505fbb488..7542a39f1fde 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -7486,6 +7486,25 @@ int sched_core_idle_cpu(int cpu) > > #endif > > > > #ifdef CONFIG_SMP > > +/* > > + * Load avg and utiliztion metrics need to be updated periodically and before > > + * consumption. This function updates the metrics for all subsystems except for > > + * the fair class. @rq must be locked and have its clock updated. > > + */ > > +bool update_other_load_avgs(struct rq *rq) > > +{ > > + u64 now = rq_clock_pelt(rq); > > + const struct sched_class *curr_class = rq->curr->sched_class; > > + unsigned long thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq)); > > + > > + lockdep_assert_rq_held(rq); > > + > > + return update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) | > > + update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) | > > + update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure) | > > + update_irq_load_avg(rq, 0); > > +} > > Yeah, but you then ignore the return value and don't call into cpufreq. > > Vincent, what would be the right thing to do here? These metrics are only consumed by fair class so my first question would be: - Do we plan to have a fair and sched_ext to coexist ? Or is it exclusive ? I haven't been able to get a clear answer on this while reading the cover letter - If sched_ext is exclusive to fair then I'm not sure that you need to update them at all because they will not be used. RT uses a fix freq and DL use the Sum of running DL bandwidth. But if both an coexist we use be sure to update them periodically
Hello, Peter, Vincent. On Mon, Jun 24, 2024 at 06:15:20PM +0200, Vincent Guittot wrote: > > > +bool update_other_load_avgs(struct rq *rq) > > > +{ > > > + u64 now = rq_clock_pelt(rq); > > > + const struct sched_class *curr_class = rq->curr->sched_class; > > > + unsigned long thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq)); > > > + > > > + lockdep_assert_rq_held(rq); > > > + > > > + return update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) | > > > + update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) | > > > + update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure) | > > > + update_irq_load_avg(rq, 0); > > > +} > > > > Yeah, but you then ignore the return value and don't call into cpufreq. > > > > Vincent, what would be the right thing to do here? > > These metrics are only consumed by fair class so my first question would be: > > - Do we plan to have a fair and sched_ext to coexist ? Or is it > exclusive ? I haven't been able to get a clear answer on this while > reading the cover letter Oh, there are two modes. In partial mode (SCX_OPS_SWITCH_PARTIAL), there can be both CFS and SCX tasks running on the same system. If the flag is not set, the whole system is on SCX. > - If sched_ext is exclusive to fair then I'm not sure that you need to > update them at all because they will not be used. RT uses a fix freq > and DL use the Sum of running DL bandwidth. But if both an coexist we > use be sure to update them periodically Hmm.... I think I saw RT's schedutil signal stuck high constantly pushing up the frequency. I might be mistaken tho. I'll check again. Thanks.
On Mon, 24 Jun 2024 at 21:24, Tejun Heo <tj@kernel.org> wrote: > > Hello, Peter, Vincent. > > On Mon, Jun 24, 2024 at 06:15:20PM +0200, Vincent Guittot wrote: > > > > +bool update_other_load_avgs(struct rq *rq) > > > > +{ > > > > + u64 now = rq_clock_pelt(rq); > > > > + const struct sched_class *curr_class = rq->curr->sched_class; > > > > + unsigned long thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq)); > > > > + > > > > + lockdep_assert_rq_held(rq); > > > > + > > > > + return update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) | > > > > + update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) | > > > > + update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure) | > > > > + update_irq_load_avg(rq, 0); > > > > +} > > > > > > Yeah, but you then ignore the return value and don't call into cpufreq. > > > > > > Vincent, what would be the right thing to do here? > > > > These metrics are only consumed by fair class so my first question would be: > > > > - Do we plan to have a fair and sched_ext to coexist ? Or is it > > exclusive ? I haven't been able to get a clear answer on this while > > reading the cover letter > > Oh, there are two modes. In partial mode (SCX_OPS_SWITCH_PARTIAL), there can > be both CFS and SCX tasks running on the same system. If the flag is not > set, the whole system is on SCX. ok thanks for the clarification because this will have an impact on how fair tasks are scheduled > > > - If sched_ext is exclusive to fair then I'm not sure that you need to > > update them at all because they will not be used. RT uses a fix freq > > and DL use the Sum of running DL bandwidth. But if both an coexist we > > use be sure to update them periodically > > Hmm.... I think I saw RT's schedutil signal stuck high constantly pushing up > the frequency. I might be mistaken tho. I'll check again. This is used when selecting a frequency for fair tasks > > Thanks. > > -- > tejun
Hello, On Tue, Jun 25, 2024 at 11:13:54AM +0200, Vincent Guittot wrote: > > Hmm.... I think I saw RT's schedutil signal stuck high constantly pushing up > > the frequency. I might be mistaken tho. I'll check again. > > This is used when selecting a frequency for fair tasks When schedutil is used as the governor, sugov_get_util() provides the source utilization information to determine the target frequency. sugov_get_util() gets the CFS util metric from cpu_util_cfs_boost() and then runs it through effective_cpu_util(). effective_cpu_util() does a bunch of things including adding cpu_util_irq(), cpu_util_rt() and cpu_util_dl(), so if SCX doesn't decay these utilization metrics, they never decay and schedutil ends up making decisions with stale stuck-high numbers. I can easily confirm the behavior by sprinkling some trace_printks and commenting out update_other_load_avgs() on the SCX side. Thanks.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 90b505fbb488..7542a39f1fde 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -7486,6 +7486,25 @@ int sched_core_idle_cpu(int cpu) #endif #ifdef CONFIG_SMP +/* + * Load avg and utiliztion metrics need to be updated periodically and before + * consumption. This function updates the metrics for all subsystems except for + * the fair class. @rq must be locked and have its clock updated. + */ +bool update_other_load_avgs(struct rq *rq) +{ + u64 now = rq_clock_pelt(rq); + const struct sched_class *curr_class = rq->curr->sched_class; + unsigned long thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq)); + + lockdep_assert_rq_held(rq); + + return update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) | + update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) | + update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure) | + update_irq_load_avg(rq, 0); +} + /* * This function computes an effective utilization for the given CPU, to be * used for frequency selection given the linear relation: f = u * f_max. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8032256d3972..51301ae13725 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9283,28 +9283,18 @@ static inline void update_blocked_load_status(struct rq *rq, bool has_blocked) { static bool __update_blocked_others(struct rq *rq, bool *done) { - const struct sched_class *curr_class; - u64 now = rq_clock_pelt(rq); - unsigned long thermal_pressure; - bool decayed; + bool updated; /* * update_load_avg() can call cpufreq_update_util(). Make sure that RT, * DL and IRQ signals have been updated before updating CFS. */ - curr_class = rq->curr->sched_class; - - thermal_pressure = arch_scale_thermal_pressure(cpu_of(rq)); - - decayed = update_rt_rq_load_avg(now, rq, curr_class == &rt_sched_class) | - update_dl_rq_load_avg(now, rq, curr_class == &dl_sched_class) | - update_thermal_load_avg(rq_clock_thermal(rq), rq, thermal_pressure) | - update_irq_load_avg(rq, 0); + updated = update_other_load_avgs(rq); if (others_have_blocked(rq)) *done = false; - return decayed; + return updated; } #ifdef CONFIG_FAIR_GROUP_SCHED diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index bcc8056acadb..ccf2fff0e2ae 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3042,6 +3042,7 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {} #endif #ifdef CONFIG_SMP +bool update_other_load_avgs(struct rq *rq); unsigned long effective_cpu_util(int cpu, unsigned long util_cfs, unsigned long *min, unsigned long *max); @@ -3084,6 +3085,8 @@ static inline unsigned long cpu_util_rt(struct rq *rq) { return READ_ONCE(rq->avg_rt.util_avg); } +#else +static inline bool update_other_load_avgs(struct rq *rq) { return false; } #endif #ifdef CONFIG_UCLAMP_TASK