Message ID | 20250213035510.2402076-1-zhanjie9@hisilicon.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v3] cpufreq: governor: Fix negative 'idle_time' handling in dbs_update() | expand |
A kindly ping on this patch. Thanks, Jie On 13/02/2025 11:55, Jie Zhan wrote: > We observed an issue that the cpu frequency can't raise up with a 100% cpu > load when NOHZ is off and the 'conservative' governor is selected. > > 'idle_time' can be negative if it's obtained from get_cpu_idle_time_jiffy() > when NOHZ is off. This was found and explained in commit 9485e4ca0b48 > ("cpufreq: governor: Fix handling of special cases in dbs_update()"). > > However, commit 7592019634f8 ("cpufreq: governors: Fix long idle detection > logic in load calculation") introduced a comparison between 'idle_time' and > 'samling_rate' to detect a long idle interval. While 'idle_time' is > converted to int before comparison, it's actually promoted to unsigned > again when compared with an unsigned 'sampling_rate'. Hence, this leads to > wrong idle interval detection when it's in fact 100% busy and sets > policy_dbs->idle_periods to a very large value. 'conservative' adjusts the > frequency to minimum because of the large 'idle_periods', such that the > frequency can't raise up. 'Ondemand' doesn't use policy_dbs->idle_periods > so it fortunately avoids the issue. > > Correct negative 'idle_time' to 0 before any use of it in dbs_update(). > > Fixes: 7592019634f8 ("cpufreq: governors: Fix long idle detection logic in load calculation") > Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com> > Reviewed-by: Chen Yu <yu.c.chen@intel.com> > --- > v3: > - Remove ternary operators. > > v2: > - Avoid type conversion, compare current and previous idle time before > obtaining 'idle_time'. > - Update the explanation in comments. > > Discussions: > v2: https://lore.kernel.org/linux-pm/20250212081438.1294503-1-zhanjie9@hisilicon.com/ > v1: https://lore.kernel.org/linux-pm/20250210130659.3533182-1-zhanjie9@hisilicon.com/ > --- > drivers/cpufreq/cpufreq_governor.c | 45 +++++++++++++++--------------- > 1 file changed, 23 insertions(+), 22 deletions(-) > > diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c > index af44ee6a6430..1a7fcaf39cc9 100644 > --- a/drivers/cpufreq/cpufreq_governor.c > +++ b/drivers/cpufreq/cpufreq_governor.c > @@ -145,7 +145,23 @@ unsigned int dbs_update(struct cpufreq_policy *policy) > time_elapsed = update_time - j_cdbs->prev_update_time; > j_cdbs->prev_update_time = update_time; > > - idle_time = cur_idle_time - j_cdbs->prev_cpu_idle; > + /* > + * cur_idle_time could be smaller than j_cdbs->prev_cpu_idle if > + * it's obtained from get_cpu_idle_time_jiffy() when NOHZ is > + * off, where idle_time is calculated by the difference between > + * time elapsed in jiffies and "busy time" obtained from CPU > + * statistics. If a CPU is 100% busy, the time elapsed and busy > + * time should grow with the same amount in two consecutive > + * samples, but in practice there could be a tiny difference, > + * making the accumulated idle time decrease sometimes. Hence, > + * in this case, idle_time should be regarded as 0 in order to > + * make the further process correct. > + */ > + if (cur_idle_time > j_cdbs->prev_cpu_idle) > + idle_time = cur_idle_time - j_cdbs->prev_cpu_idle; > + else > + idle_time = 0; > + > j_cdbs->prev_cpu_idle = cur_idle_time; > > if (ignore_nice) { > @@ -162,7 +178,7 @@ unsigned int dbs_update(struct cpufreq_policy *policy) > * calls, so the previous load value can be used then. > */ > load = j_cdbs->prev_load; > - } else if (unlikely((int)idle_time > 2 * sampling_rate && > + } else if (unlikely(idle_time > 2 * sampling_rate && > j_cdbs->prev_load)) { > /* > * If the CPU had gone completely idle and a task has > @@ -189,30 +205,15 @@ unsigned int dbs_update(struct cpufreq_policy *policy) > load = j_cdbs->prev_load; > j_cdbs->prev_load = 0; > } else { > - if (time_elapsed >= idle_time) { > + if (time_elapsed > idle_time) > load = 100 * (time_elapsed - idle_time) / time_elapsed; > - } else { > - /* > - * That can happen if idle_time is returned by > - * get_cpu_idle_time_jiffy(). In that case > - * idle_time is roughly equal to the difference > - * between time_elapsed and "busy time" obtained > - * from CPU statistics. Then, the "busy time" > - * can end up being greater than time_elapsed > - * (for example, if jiffies_64 and the CPU > - * statistics are updated by different CPUs), > - * so idle_time may in fact be negative. That > - * means, though, that the CPU was busy all > - * the time (on the rough average) during the > - * last sampling interval and 100 can be > - * returned as the load. > - */ > - load = (int)idle_time < 0 ? 100 : 0; > - } > + else > + load = 0; > + > j_cdbs->prev_load = load; > } > > - if (unlikely((int)idle_time > 2 * sampling_rate)) { > + if (unlikely(idle_time > 2 * sampling_rate)) { > unsigned int periods = idle_time / sampling_rate; > > if (periods < idle_periods)
On Tue, Feb 18, 2025 at 1:41 PM Jie Zhan <zhanjie9@hisilicon.com> wrote: > > A kindly ping on this patch. I'll get to it in the next few days, I don't think it is urgent. Thanks! > On 13/02/2025 11:55, Jie Zhan wrote: > > We observed an issue that the cpu frequency can't raise up with a 100% cpu > > load when NOHZ is off and the 'conservative' governor is selected. > > > > 'idle_time' can be negative if it's obtained from get_cpu_idle_time_jiffy() > > when NOHZ is off. This was found and explained in commit 9485e4ca0b48 > > ("cpufreq: governor: Fix handling of special cases in dbs_update()"). > > > > However, commit 7592019634f8 ("cpufreq: governors: Fix long idle detection > > logic in load calculation") introduced a comparison between 'idle_time' and > > 'samling_rate' to detect a long idle interval. While 'idle_time' is > > converted to int before comparison, it's actually promoted to unsigned > > again when compared with an unsigned 'sampling_rate'. Hence, this leads to > > wrong idle interval detection when it's in fact 100% busy and sets > > policy_dbs->idle_periods to a very large value. 'conservative' adjusts the > > frequency to minimum because of the large 'idle_periods', such that the > > frequency can't raise up. 'Ondemand' doesn't use policy_dbs->idle_periods > > so it fortunately avoids the issue. > > > > Correct negative 'idle_time' to 0 before any use of it in dbs_update(). > > > > Fixes: 7592019634f8 ("cpufreq: governors: Fix long idle detection logic in load calculation") > > Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com> > > Reviewed-by: Chen Yu <yu.c.chen@intel.com> > > --- > > v3: > > - Remove ternary operators. > > > > v2: > > - Avoid type conversion, compare current and previous idle time before > > obtaining 'idle_time'. > > - Update the explanation in comments. > > > > Discussions: > > v2: https://lore.kernel.org/linux-pm/20250212081438.1294503-1-zhanjie9@hisilicon.com/ > > v1: https://lore.kernel.org/linux-pm/20250210130659.3533182-1-zhanjie9@hisilicon.com/ > > --- > > drivers/cpufreq/cpufreq_governor.c | 45 +++++++++++++++--------------- > > 1 file changed, 23 insertions(+), 22 deletions(-) > > > > diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c > > index af44ee6a6430..1a7fcaf39cc9 100644 > > --- a/drivers/cpufreq/cpufreq_governor.c > > +++ b/drivers/cpufreq/cpufreq_governor.c > > @@ -145,7 +145,23 @@ unsigned int dbs_update(struct cpufreq_policy *policy) > > time_elapsed = update_time - j_cdbs->prev_update_time; > > j_cdbs->prev_update_time = update_time; > > > > - idle_time = cur_idle_time - j_cdbs->prev_cpu_idle; > > + /* > > + * cur_idle_time could be smaller than j_cdbs->prev_cpu_idle if > > + * it's obtained from get_cpu_idle_time_jiffy() when NOHZ is > > + * off, where idle_time is calculated by the difference between > > + * time elapsed in jiffies and "busy time" obtained from CPU > > + * statistics. If a CPU is 100% busy, the time elapsed and busy > > + * time should grow with the same amount in two consecutive > > + * samples, but in practice there could be a tiny difference, > > + * making the accumulated idle time decrease sometimes. Hence, > > + * in this case, idle_time should be regarded as 0 in order to > > + * make the further process correct. > > + */ > > + if (cur_idle_time > j_cdbs->prev_cpu_idle) > > + idle_time = cur_idle_time - j_cdbs->prev_cpu_idle; > > + else > > + idle_time = 0; > > + > > j_cdbs->prev_cpu_idle = cur_idle_time; > > > > if (ignore_nice) { > > @@ -162,7 +178,7 @@ unsigned int dbs_update(struct cpufreq_policy *policy) > > * calls, so the previous load value can be used then. > > */ > > load = j_cdbs->prev_load; > > - } else if (unlikely((int)idle_time > 2 * sampling_rate && > > + } else if (unlikely(idle_time > 2 * sampling_rate && > > j_cdbs->prev_load)) { > > /* > > * If the CPU had gone completely idle and a task has > > @@ -189,30 +205,15 @@ unsigned int dbs_update(struct cpufreq_policy *policy) > > load = j_cdbs->prev_load; > > j_cdbs->prev_load = 0; > > } else { > > - if (time_elapsed >= idle_time) { > > + if (time_elapsed > idle_time) > > load = 100 * (time_elapsed - idle_time) / time_elapsed; > > - } else { > > - /* > > - * That can happen if idle_time is returned by > > - * get_cpu_idle_time_jiffy(). In that case > > - * idle_time is roughly equal to the difference > > - * between time_elapsed and "busy time" obtained > > - * from CPU statistics. Then, the "busy time" > > - * can end up being greater than time_elapsed > > - * (for example, if jiffies_64 and the CPU > > - * statistics are updated by different CPUs), > > - * so idle_time may in fact be negative. That > > - * means, though, that the CPU was busy all > > - * the time (on the rough average) during the > > - * last sampling interval and 100 can be > > - * returned as the load. > > - */ > > - load = (int)idle_time < 0 ? 100 : 0; > > - } > > + else > > + load = 0; > > + > > j_cdbs->prev_load = load; > > } > > > > - if (unlikely((int)idle_time > 2 * sampling_rate)) { > > + if (unlikely(idle_time > 2 * sampling_rate)) { > > unsigned int periods = idle_time / sampling_rate; > > > > if (periods < idle_periods)
Yeah, no problem! Just in case it's missed. Thanks. On 18/02/2025 20:43, Rafael J. Wysocki wrote: > On Tue, Feb 18, 2025 at 1:41 PM Jie Zhan <zhanjie9@hisilicon.com> wrote: >> >> A kindly ping on this patch. > > I'll get to it in the next few days, I don't think it is urgent. > > Thanks! > >> On 13/02/2025 11:55, Jie Zhan wrote: >>> We observed an issue that the cpu frequency can't raise up with a 100% cpu >>> load when NOHZ is off and the 'conservative' governor is selected. >>> >>> 'idle_time' can be negative if it's obtained from get_cpu_idle_time_jiffy() >>> when NOHZ is off. This was found and explained in commit 9485e4ca0b48 >>> ("cpufreq: governor: Fix handling of special cases in dbs_update()"). >>> >>> However, commit 7592019634f8 ("cpufreq: governors: Fix long idle detection >>> logic in load calculation") introduced a comparison between 'idle_time' and >>> 'samling_rate' to detect a long idle interval. While 'idle_time' is >>> converted to int before comparison, it's actually promoted to unsigned >>> again when compared with an unsigned 'sampling_rate'. Hence, this leads to >>> wrong idle interval detection when it's in fact 100% busy and sets >>> policy_dbs->idle_periods to a very large value. 'conservative' adjusts the >>> frequency to minimum because of the large 'idle_periods', such that the >>> frequency can't raise up. 'Ondemand' doesn't use policy_dbs->idle_periods >>> so it fortunately avoids the issue. >>> >>> Correct negative 'idle_time' to 0 before any use of it in dbs_update(). >>> >>> Fixes: 7592019634f8 ("cpufreq: governors: Fix long idle detection logic in load calculation") >>> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com> >>> Reviewed-by: Chen Yu <yu.c.chen@intel.com> >>> --- >>> v3: >>> - Remove ternary operators. >>> >>> v2: >>> - Avoid type conversion, compare current and previous idle time before >>> obtaining 'idle_time'. >>> - Update the explanation in comments. >>> >>> Discussions: >>> v2: https://lore.kernel.org/linux-pm/20250212081438.1294503-1-zhanjie9@hisilicon.com/ >>> v1: https://lore.kernel.org/linux-pm/20250210130659.3533182-1-zhanjie9@hisilicon.com/ >>> --- >>> drivers/cpufreq/cpufreq_governor.c | 45 +++++++++++++++--------------- >>> 1 file changed, 23 insertions(+), 22 deletions(-) >>> >>> diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c >>> index af44ee6a6430..1a7fcaf39cc9 100644 >>> --- a/drivers/cpufreq/cpufreq_governor.c >>> +++ b/drivers/cpufreq/cpufreq_governor.c >>> @@ -145,7 +145,23 @@ unsigned int dbs_update(struct cpufreq_policy *policy) >>> time_elapsed = update_time - j_cdbs->prev_update_time; >>> j_cdbs->prev_update_time = update_time; >>> >>> - idle_time = cur_idle_time - j_cdbs->prev_cpu_idle; >>> + /* >>> + * cur_idle_time could be smaller than j_cdbs->prev_cpu_idle if >>> + * it's obtained from get_cpu_idle_time_jiffy() when NOHZ is >>> + * off, where idle_time is calculated by the difference between >>> + * time elapsed in jiffies and "busy time" obtained from CPU >>> + * statistics. If a CPU is 100% busy, the time elapsed and busy >>> + * time should grow with the same amount in two consecutive >>> + * samples, but in practice there could be a tiny difference, >>> + * making the accumulated idle time decrease sometimes. Hence, >>> + * in this case, idle_time should be regarded as 0 in order to >>> + * make the further process correct. >>> + */ >>> + if (cur_idle_time > j_cdbs->prev_cpu_idle) >>> + idle_time = cur_idle_time - j_cdbs->prev_cpu_idle; >>> + else >>> + idle_time = 0; >>> + >>> j_cdbs->prev_cpu_idle = cur_idle_time; >>> >>> if (ignore_nice) { >>> @@ -162,7 +178,7 @@ unsigned int dbs_update(struct cpufreq_policy *policy) >>> * calls, so the previous load value can be used then. >>> */ >>> load = j_cdbs->prev_load; >>> - } else if (unlikely((int)idle_time > 2 * sampling_rate && >>> + } else if (unlikely(idle_time > 2 * sampling_rate && >>> j_cdbs->prev_load)) { >>> /* >>> * If the CPU had gone completely idle and a task has >>> @@ -189,30 +205,15 @@ unsigned int dbs_update(struct cpufreq_policy *policy) >>> load = j_cdbs->prev_load; >>> j_cdbs->prev_load = 0; >>> } else { >>> - if (time_elapsed >= idle_time) { >>> + if (time_elapsed > idle_time) >>> load = 100 * (time_elapsed - idle_time) / time_elapsed; >>> - } else { >>> - /* >>> - * That can happen if idle_time is returned by >>> - * get_cpu_idle_time_jiffy(). In that case >>> - * idle_time is roughly equal to the difference >>> - * between time_elapsed and "busy time" obtained >>> - * from CPU statistics. Then, the "busy time" >>> - * can end up being greater than time_elapsed >>> - * (for example, if jiffies_64 and the CPU >>> - * statistics are updated by different CPUs), >>> - * so idle_time may in fact be negative. That >>> - * means, though, that the CPU was busy all >>> - * the time (on the rough average) during the >>> - * last sampling interval and 100 can be >>> - * returned as the load. >>> - */ >>> - load = (int)idle_time < 0 ? 100 : 0; >>> - } >>> + else >>> + load = 0; >>> + >>> j_cdbs->prev_load = load; >>> } >>> >>> - if (unlikely((int)idle_time > 2 * sampling_rate)) { >>> + if (unlikely(idle_time > 2 * sampling_rate)) { >>> unsigned int periods = idle_time / sampling_rate; >>> >>> if (periods < idle_periods)
diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c index af44ee6a6430..1a7fcaf39cc9 100644 --- a/drivers/cpufreq/cpufreq_governor.c +++ b/drivers/cpufreq/cpufreq_governor.c @@ -145,7 +145,23 @@ unsigned int dbs_update(struct cpufreq_policy *policy) time_elapsed = update_time - j_cdbs->prev_update_time; j_cdbs->prev_update_time = update_time; - idle_time = cur_idle_time - j_cdbs->prev_cpu_idle; + /* + * cur_idle_time could be smaller than j_cdbs->prev_cpu_idle if + * it's obtained from get_cpu_idle_time_jiffy() when NOHZ is + * off, where idle_time is calculated by the difference between + * time elapsed in jiffies and "busy time" obtained from CPU + * statistics. If a CPU is 100% busy, the time elapsed and busy + * time should grow with the same amount in two consecutive + * samples, but in practice there could be a tiny difference, + * making the accumulated idle time decrease sometimes. Hence, + * in this case, idle_time should be regarded as 0 in order to + * make the further process correct. + */ + if (cur_idle_time > j_cdbs->prev_cpu_idle) + idle_time = cur_idle_time - j_cdbs->prev_cpu_idle; + else + idle_time = 0; + j_cdbs->prev_cpu_idle = cur_idle_time; if (ignore_nice) { @@ -162,7 +178,7 @@ unsigned int dbs_update(struct cpufreq_policy *policy) * calls, so the previous load value can be used then. */ load = j_cdbs->prev_load; - } else if (unlikely((int)idle_time > 2 * sampling_rate && + } else if (unlikely(idle_time > 2 * sampling_rate && j_cdbs->prev_load)) { /* * If the CPU had gone completely idle and a task has @@ -189,30 +205,15 @@ unsigned int dbs_update(struct cpufreq_policy *policy) load = j_cdbs->prev_load; j_cdbs->prev_load = 0; } else { - if (time_elapsed >= idle_time) { + if (time_elapsed > idle_time) load = 100 * (time_elapsed - idle_time) / time_elapsed; - } else { - /* - * That can happen if idle_time is returned by - * get_cpu_idle_time_jiffy(). In that case - * idle_time is roughly equal to the difference - * between time_elapsed and "busy time" obtained - * from CPU statistics. Then, the "busy time" - * can end up being greater than time_elapsed - * (for example, if jiffies_64 and the CPU - * statistics are updated by different CPUs), - * so idle_time may in fact be negative. That - * means, though, that the CPU was busy all - * the time (on the rough average) during the - * last sampling interval and 100 can be - * returned as the load. - */ - load = (int)idle_time < 0 ? 100 : 0; - } + else + load = 0; + j_cdbs->prev_load = load; } - if (unlikely((int)idle_time > 2 * sampling_rate)) { + if (unlikely(idle_time > 2 * sampling_rate)) { unsigned int periods = idle_time / sampling_rate; if (periods < idle_periods)