diff mbox series

[v3] cpufreq: governor: Fix negative 'idle_time' handling in dbs_update()

Message ID 20250213035510.2402076-1-zhanjie9@hisilicon.com (mailing list archive)
State New
Headers show
Series [v3] cpufreq: governor: Fix negative 'idle_time' handling in dbs_update() | expand

Commit Message

Jie Zhan Feb. 13, 2025, 3:55 a.m. UTC
We observed an issue that the cpu frequency can't raise up with a 100% cpu
load when NOHZ is off and the 'conservative' governor is selected.

'idle_time' can be negative if it's obtained from get_cpu_idle_time_jiffy()
when NOHZ is off.  This was found and explained in commit 9485e4ca0b48
("cpufreq: governor: Fix handling of special cases in dbs_update()").

However, commit 7592019634f8 ("cpufreq: governors: Fix long idle detection
logic in load calculation") introduced a comparison between 'idle_time' and
'samling_rate' to detect a long idle interval.  While 'idle_time' is
converted to int before comparison, it's actually promoted to unsigned
again when compared with an unsigned 'sampling_rate'.  Hence, this leads to
wrong idle interval detection when it's in fact 100% busy and sets
policy_dbs->idle_periods to a very large value.  'conservative' adjusts the
frequency to minimum because of the large 'idle_periods', such that the
frequency can't raise up.  'Ondemand' doesn't use policy_dbs->idle_periods
so it fortunately avoids the issue.

Correct negative 'idle_time' to 0 before any use of it in dbs_update().

Fixes: 7592019634f8 ("cpufreq: governors: Fix long idle detection logic in load calculation")
Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
---
v3:
- Remove ternary operators.

v2:
- Avoid type conversion, compare current and previous idle time before
  obtaining 'idle_time'.
- Update the explanation in comments.

Discussions:
v2: https://lore.kernel.org/linux-pm/20250212081438.1294503-1-zhanjie9@hisilicon.com/
v1: https://lore.kernel.org/linux-pm/20250210130659.3533182-1-zhanjie9@hisilicon.com/
---
 drivers/cpufreq/cpufreq_governor.c | 45 +++++++++++++++---------------
 1 file changed, 23 insertions(+), 22 deletions(-)

Comments

Jie Zhan Feb. 18, 2025, 12:41 p.m. UTC | #1
A kindly ping on this patch.

Thanks,
Jie

On 13/02/2025 11:55, Jie Zhan wrote:
> We observed an issue that the cpu frequency can't raise up with a 100% cpu
> load when NOHZ is off and the 'conservative' governor is selected.
> 
> 'idle_time' can be negative if it's obtained from get_cpu_idle_time_jiffy()
> when NOHZ is off.  This was found and explained in commit 9485e4ca0b48
> ("cpufreq: governor: Fix handling of special cases in dbs_update()").
> 
> However, commit 7592019634f8 ("cpufreq: governors: Fix long idle detection
> logic in load calculation") introduced a comparison between 'idle_time' and
> 'samling_rate' to detect a long idle interval.  While 'idle_time' is
> converted to int before comparison, it's actually promoted to unsigned
> again when compared with an unsigned 'sampling_rate'.  Hence, this leads to
> wrong idle interval detection when it's in fact 100% busy and sets
> policy_dbs->idle_periods to a very large value.  'conservative' adjusts the
> frequency to minimum because of the large 'idle_periods', such that the
> frequency can't raise up.  'Ondemand' doesn't use policy_dbs->idle_periods
> so it fortunately avoids the issue.
> 
> Correct negative 'idle_time' to 0 before any use of it in dbs_update().
> 
> Fixes: 7592019634f8 ("cpufreq: governors: Fix long idle detection logic in load calculation")
> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
> Reviewed-by: Chen Yu <yu.c.chen@intel.com>
> ---
> v3:
> - Remove ternary operators.
> 
> v2:
> - Avoid type conversion, compare current and previous idle time before
>   obtaining 'idle_time'.
> - Update the explanation in comments.
> 
> Discussions:
> v2: https://lore.kernel.org/linux-pm/20250212081438.1294503-1-zhanjie9@hisilicon.com/
> v1: https://lore.kernel.org/linux-pm/20250210130659.3533182-1-zhanjie9@hisilicon.com/
> ---
>  drivers/cpufreq/cpufreq_governor.c | 45 +++++++++++++++---------------
>  1 file changed, 23 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
> index af44ee6a6430..1a7fcaf39cc9 100644
> --- a/drivers/cpufreq/cpufreq_governor.c
> +++ b/drivers/cpufreq/cpufreq_governor.c
> @@ -145,7 +145,23 @@ unsigned int dbs_update(struct cpufreq_policy *policy)
>  		time_elapsed = update_time - j_cdbs->prev_update_time;
>  		j_cdbs->prev_update_time = update_time;
>  
> -		idle_time = cur_idle_time - j_cdbs->prev_cpu_idle;
> +		/*
> +		 * cur_idle_time could be smaller than j_cdbs->prev_cpu_idle if
> +		 * it's obtained from get_cpu_idle_time_jiffy() when NOHZ is
> +		 * off, where idle_time is calculated by the difference between
> +		 * time elapsed in jiffies and "busy time" obtained from CPU
> +		 * statistics.  If a CPU is 100% busy, the time elapsed and busy
> +		 * time should grow with the same amount in two consecutive
> +		 * samples, but in practice there could be a tiny difference,
> +		 * making the accumulated idle time decrease sometimes.  Hence,
> +		 * in this case, idle_time should be regarded as 0 in order to
> +		 * make the further process correct.
> +		 */
> +		if (cur_idle_time > j_cdbs->prev_cpu_idle)
> +			idle_time = cur_idle_time - j_cdbs->prev_cpu_idle;
> +		else
> +			idle_time = 0;
> +
>  		j_cdbs->prev_cpu_idle = cur_idle_time;
>  
>  		if (ignore_nice) {
> @@ -162,7 +178,7 @@ unsigned int dbs_update(struct cpufreq_policy *policy)
>  			 * calls, so the previous load value can be used then.
>  			 */
>  			load = j_cdbs->prev_load;
> -		} else if (unlikely((int)idle_time > 2 * sampling_rate &&
> +		} else if (unlikely(idle_time > 2 * sampling_rate &&
>  				    j_cdbs->prev_load)) {
>  			/*
>  			 * If the CPU had gone completely idle and a task has
> @@ -189,30 +205,15 @@ unsigned int dbs_update(struct cpufreq_policy *policy)
>  			load = j_cdbs->prev_load;
>  			j_cdbs->prev_load = 0;
>  		} else {
> -			if (time_elapsed >= idle_time) {
> +			if (time_elapsed > idle_time)
>  				load = 100 * (time_elapsed - idle_time) / time_elapsed;
> -			} else {
> -				/*
> -				 * That can happen if idle_time is returned by
> -				 * get_cpu_idle_time_jiffy().  In that case
> -				 * idle_time is roughly equal to the difference
> -				 * between time_elapsed and "busy time" obtained
> -				 * from CPU statistics.  Then, the "busy time"
> -				 * can end up being greater than time_elapsed
> -				 * (for example, if jiffies_64 and the CPU
> -				 * statistics are updated by different CPUs),
> -				 * so idle_time may in fact be negative.  That
> -				 * means, though, that the CPU was busy all
> -				 * the time (on the rough average) during the
> -				 * last sampling interval and 100 can be
> -				 * returned as the load.
> -				 */
> -				load = (int)idle_time < 0 ? 100 : 0;
> -			}
> +			else
> +				load = 0;
> +
>  			j_cdbs->prev_load = load;
>  		}
>  
> -		if (unlikely((int)idle_time > 2 * sampling_rate)) {
> +		if (unlikely(idle_time > 2 * sampling_rate)) {
>  			unsigned int periods = idle_time / sampling_rate;
>  
>  			if (periods < idle_periods)
Rafael J. Wysocki Feb. 18, 2025, 12:43 p.m. UTC | #2
On Tue, Feb 18, 2025 at 1:41 PM Jie Zhan <zhanjie9@hisilicon.com> wrote:
>
> A kindly ping on this patch.

I'll get to it in the next few days, I don't think it is urgent.

Thanks!

> On 13/02/2025 11:55, Jie Zhan wrote:
> > We observed an issue that the cpu frequency can't raise up with a 100% cpu
> > load when NOHZ is off and the 'conservative' governor is selected.
> >
> > 'idle_time' can be negative if it's obtained from get_cpu_idle_time_jiffy()
> > when NOHZ is off.  This was found and explained in commit 9485e4ca0b48
> > ("cpufreq: governor: Fix handling of special cases in dbs_update()").
> >
> > However, commit 7592019634f8 ("cpufreq: governors: Fix long idle detection
> > logic in load calculation") introduced a comparison between 'idle_time' and
> > 'samling_rate' to detect a long idle interval.  While 'idle_time' is
> > converted to int before comparison, it's actually promoted to unsigned
> > again when compared with an unsigned 'sampling_rate'.  Hence, this leads to
> > wrong idle interval detection when it's in fact 100% busy and sets
> > policy_dbs->idle_periods to a very large value.  'conservative' adjusts the
> > frequency to minimum because of the large 'idle_periods', such that the
> > frequency can't raise up.  'Ondemand' doesn't use policy_dbs->idle_periods
> > so it fortunately avoids the issue.
> >
> > Correct negative 'idle_time' to 0 before any use of it in dbs_update().
> >
> > Fixes: 7592019634f8 ("cpufreq: governors: Fix long idle detection logic in load calculation")
> > Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
> > Reviewed-by: Chen Yu <yu.c.chen@intel.com>
> > ---
> > v3:
> > - Remove ternary operators.
> >
> > v2:
> > - Avoid type conversion, compare current and previous idle time before
> >   obtaining 'idle_time'.
> > - Update the explanation in comments.
> >
> > Discussions:
> > v2: https://lore.kernel.org/linux-pm/20250212081438.1294503-1-zhanjie9@hisilicon.com/
> > v1: https://lore.kernel.org/linux-pm/20250210130659.3533182-1-zhanjie9@hisilicon.com/
> > ---
> >  drivers/cpufreq/cpufreq_governor.c | 45 +++++++++++++++---------------
> >  1 file changed, 23 insertions(+), 22 deletions(-)
> >
> > diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
> > index af44ee6a6430..1a7fcaf39cc9 100644
> > --- a/drivers/cpufreq/cpufreq_governor.c
> > +++ b/drivers/cpufreq/cpufreq_governor.c
> > @@ -145,7 +145,23 @@ unsigned int dbs_update(struct cpufreq_policy *policy)
> >               time_elapsed = update_time - j_cdbs->prev_update_time;
> >               j_cdbs->prev_update_time = update_time;
> >
> > -             idle_time = cur_idle_time - j_cdbs->prev_cpu_idle;
> > +             /*
> > +              * cur_idle_time could be smaller than j_cdbs->prev_cpu_idle if
> > +              * it's obtained from get_cpu_idle_time_jiffy() when NOHZ is
> > +              * off, where idle_time is calculated by the difference between
> > +              * time elapsed in jiffies and "busy time" obtained from CPU
> > +              * statistics.  If a CPU is 100% busy, the time elapsed and busy
> > +              * time should grow with the same amount in two consecutive
> > +              * samples, but in practice there could be a tiny difference,
> > +              * making the accumulated idle time decrease sometimes.  Hence,
> > +              * in this case, idle_time should be regarded as 0 in order to
> > +              * make the further process correct.
> > +              */
> > +             if (cur_idle_time > j_cdbs->prev_cpu_idle)
> > +                     idle_time = cur_idle_time - j_cdbs->prev_cpu_idle;
> > +             else
> > +                     idle_time = 0;
> > +
> >               j_cdbs->prev_cpu_idle = cur_idle_time;
> >
> >               if (ignore_nice) {
> > @@ -162,7 +178,7 @@ unsigned int dbs_update(struct cpufreq_policy *policy)
> >                        * calls, so the previous load value can be used then.
> >                        */
> >                       load = j_cdbs->prev_load;
> > -             } else if (unlikely((int)idle_time > 2 * sampling_rate &&
> > +             } else if (unlikely(idle_time > 2 * sampling_rate &&
> >                                   j_cdbs->prev_load)) {
> >                       /*
> >                        * If the CPU had gone completely idle and a task has
> > @@ -189,30 +205,15 @@ unsigned int dbs_update(struct cpufreq_policy *policy)
> >                       load = j_cdbs->prev_load;
> >                       j_cdbs->prev_load = 0;
> >               } else {
> > -                     if (time_elapsed >= idle_time) {
> > +                     if (time_elapsed > idle_time)
> >                               load = 100 * (time_elapsed - idle_time) / time_elapsed;
> > -                     } else {
> > -                             /*
> > -                              * That can happen if idle_time is returned by
> > -                              * get_cpu_idle_time_jiffy().  In that case
> > -                              * idle_time is roughly equal to the difference
> > -                              * between time_elapsed and "busy time" obtained
> > -                              * from CPU statistics.  Then, the "busy time"
> > -                              * can end up being greater than time_elapsed
> > -                              * (for example, if jiffies_64 and the CPU
> > -                              * statistics are updated by different CPUs),
> > -                              * so idle_time may in fact be negative.  That
> > -                              * means, though, that the CPU was busy all
> > -                              * the time (on the rough average) during the
> > -                              * last sampling interval and 100 can be
> > -                              * returned as the load.
> > -                              */
> > -                             load = (int)idle_time < 0 ? 100 : 0;
> > -                     }
> > +                     else
> > +                             load = 0;
> > +
> >                       j_cdbs->prev_load = load;
> >               }
> >
> > -             if (unlikely((int)idle_time > 2 * sampling_rate)) {
> > +             if (unlikely(idle_time > 2 * sampling_rate)) {
> >                       unsigned int periods = idle_time / sampling_rate;
> >
> >                       if (periods < idle_periods)
Jie Zhan Feb. 19, 2025, 2:42 a.m. UTC | #3
Yeah, no problem!  Just in case it's missed.

Thanks.

On 18/02/2025 20:43, Rafael J. Wysocki wrote:
> On Tue, Feb 18, 2025 at 1:41 PM Jie Zhan <zhanjie9@hisilicon.com> wrote:
>>
>> A kindly ping on this patch.
> 
> I'll get to it in the next few days, I don't think it is urgent.
> 
> Thanks!
> 
>> On 13/02/2025 11:55, Jie Zhan wrote:
>>> We observed an issue that the cpu frequency can't raise up with a 100% cpu
>>> load when NOHZ is off and the 'conservative' governor is selected.
>>>
>>> 'idle_time' can be negative if it's obtained from get_cpu_idle_time_jiffy()
>>> when NOHZ is off.  This was found and explained in commit 9485e4ca0b48
>>> ("cpufreq: governor: Fix handling of special cases in dbs_update()").
>>>
>>> However, commit 7592019634f8 ("cpufreq: governors: Fix long idle detection
>>> logic in load calculation") introduced a comparison between 'idle_time' and
>>> 'samling_rate' to detect a long idle interval.  While 'idle_time' is
>>> converted to int before comparison, it's actually promoted to unsigned
>>> again when compared with an unsigned 'sampling_rate'.  Hence, this leads to
>>> wrong idle interval detection when it's in fact 100% busy and sets
>>> policy_dbs->idle_periods to a very large value.  'conservative' adjusts the
>>> frequency to minimum because of the large 'idle_periods', such that the
>>> frequency can't raise up.  'Ondemand' doesn't use policy_dbs->idle_periods
>>> so it fortunately avoids the issue.
>>>
>>> Correct negative 'idle_time' to 0 before any use of it in dbs_update().
>>>
>>> Fixes: 7592019634f8 ("cpufreq: governors: Fix long idle detection logic in load calculation")
>>> Signed-off-by: Jie Zhan <zhanjie9@hisilicon.com>
>>> Reviewed-by: Chen Yu <yu.c.chen@intel.com>
>>> ---
>>> v3:
>>> - Remove ternary operators.
>>>
>>> v2:
>>> - Avoid type conversion, compare current and previous idle time before
>>>   obtaining 'idle_time'.
>>> - Update the explanation in comments.
>>>
>>> Discussions:
>>> v2: https://lore.kernel.org/linux-pm/20250212081438.1294503-1-zhanjie9@hisilicon.com/
>>> v1: https://lore.kernel.org/linux-pm/20250210130659.3533182-1-zhanjie9@hisilicon.com/
>>> ---
>>>  drivers/cpufreq/cpufreq_governor.c | 45 +++++++++++++++---------------
>>>  1 file changed, 23 insertions(+), 22 deletions(-)
>>>
>>> diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
>>> index af44ee6a6430..1a7fcaf39cc9 100644
>>> --- a/drivers/cpufreq/cpufreq_governor.c
>>> +++ b/drivers/cpufreq/cpufreq_governor.c
>>> @@ -145,7 +145,23 @@ unsigned int dbs_update(struct cpufreq_policy *policy)
>>>               time_elapsed = update_time - j_cdbs->prev_update_time;
>>>               j_cdbs->prev_update_time = update_time;
>>>
>>> -             idle_time = cur_idle_time - j_cdbs->prev_cpu_idle;
>>> +             /*
>>> +              * cur_idle_time could be smaller than j_cdbs->prev_cpu_idle if
>>> +              * it's obtained from get_cpu_idle_time_jiffy() when NOHZ is
>>> +              * off, where idle_time is calculated by the difference between
>>> +              * time elapsed in jiffies and "busy time" obtained from CPU
>>> +              * statistics.  If a CPU is 100% busy, the time elapsed and busy
>>> +              * time should grow with the same amount in two consecutive
>>> +              * samples, but in practice there could be a tiny difference,
>>> +              * making the accumulated idle time decrease sometimes.  Hence,
>>> +              * in this case, idle_time should be regarded as 0 in order to
>>> +              * make the further process correct.
>>> +              */
>>> +             if (cur_idle_time > j_cdbs->prev_cpu_idle)
>>> +                     idle_time = cur_idle_time - j_cdbs->prev_cpu_idle;
>>> +             else
>>> +                     idle_time = 0;
>>> +
>>>               j_cdbs->prev_cpu_idle = cur_idle_time;
>>>
>>>               if (ignore_nice) {
>>> @@ -162,7 +178,7 @@ unsigned int dbs_update(struct cpufreq_policy *policy)
>>>                        * calls, so the previous load value can be used then.
>>>                        */
>>>                       load = j_cdbs->prev_load;
>>> -             } else if (unlikely((int)idle_time > 2 * sampling_rate &&
>>> +             } else if (unlikely(idle_time > 2 * sampling_rate &&
>>>                                   j_cdbs->prev_load)) {
>>>                       /*
>>>                        * If the CPU had gone completely idle and a task has
>>> @@ -189,30 +205,15 @@ unsigned int dbs_update(struct cpufreq_policy *policy)
>>>                       load = j_cdbs->prev_load;
>>>                       j_cdbs->prev_load = 0;
>>>               } else {
>>> -                     if (time_elapsed >= idle_time) {
>>> +                     if (time_elapsed > idle_time)
>>>                               load = 100 * (time_elapsed - idle_time) / time_elapsed;
>>> -                     } else {
>>> -                             /*
>>> -                              * That can happen if idle_time is returned by
>>> -                              * get_cpu_idle_time_jiffy().  In that case
>>> -                              * idle_time is roughly equal to the difference
>>> -                              * between time_elapsed and "busy time" obtained
>>> -                              * from CPU statistics.  Then, the "busy time"
>>> -                              * can end up being greater than time_elapsed
>>> -                              * (for example, if jiffies_64 and the CPU
>>> -                              * statistics are updated by different CPUs),
>>> -                              * so idle_time may in fact be negative.  That
>>> -                              * means, though, that the CPU was busy all
>>> -                              * the time (on the rough average) during the
>>> -                              * last sampling interval and 100 can be
>>> -                              * returned as the load.
>>> -                              */
>>> -                             load = (int)idle_time < 0 ? 100 : 0;
>>> -                     }
>>> +                     else
>>> +                             load = 0;
>>> +
>>>                       j_cdbs->prev_load = load;
>>>               }
>>>
>>> -             if (unlikely((int)idle_time > 2 * sampling_rate)) {
>>> +             if (unlikely(idle_time > 2 * sampling_rate)) {
>>>                       unsigned int periods = idle_time / sampling_rate;
>>>
>>>                       if (periods < idle_periods)
diff mbox series

Patch

diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
index af44ee6a6430..1a7fcaf39cc9 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -145,7 +145,23 @@  unsigned int dbs_update(struct cpufreq_policy *policy)
 		time_elapsed = update_time - j_cdbs->prev_update_time;
 		j_cdbs->prev_update_time = update_time;
 
-		idle_time = cur_idle_time - j_cdbs->prev_cpu_idle;
+		/*
+		 * cur_idle_time could be smaller than j_cdbs->prev_cpu_idle if
+		 * it's obtained from get_cpu_idle_time_jiffy() when NOHZ is
+		 * off, where idle_time is calculated by the difference between
+		 * time elapsed in jiffies and "busy time" obtained from CPU
+		 * statistics.  If a CPU is 100% busy, the time elapsed and busy
+		 * time should grow with the same amount in two consecutive
+		 * samples, but in practice there could be a tiny difference,
+		 * making the accumulated idle time decrease sometimes.  Hence,
+		 * in this case, idle_time should be regarded as 0 in order to
+		 * make the further process correct.
+		 */
+		if (cur_idle_time > j_cdbs->prev_cpu_idle)
+			idle_time = cur_idle_time - j_cdbs->prev_cpu_idle;
+		else
+			idle_time = 0;
+
 		j_cdbs->prev_cpu_idle = cur_idle_time;
 
 		if (ignore_nice) {
@@ -162,7 +178,7 @@  unsigned int dbs_update(struct cpufreq_policy *policy)
 			 * calls, so the previous load value can be used then.
 			 */
 			load = j_cdbs->prev_load;
-		} else if (unlikely((int)idle_time > 2 * sampling_rate &&
+		} else if (unlikely(idle_time > 2 * sampling_rate &&
 				    j_cdbs->prev_load)) {
 			/*
 			 * If the CPU had gone completely idle and a task has
@@ -189,30 +205,15 @@  unsigned int dbs_update(struct cpufreq_policy *policy)
 			load = j_cdbs->prev_load;
 			j_cdbs->prev_load = 0;
 		} else {
-			if (time_elapsed >= idle_time) {
+			if (time_elapsed > idle_time)
 				load = 100 * (time_elapsed - idle_time) / time_elapsed;
-			} else {
-				/*
-				 * That can happen if idle_time is returned by
-				 * get_cpu_idle_time_jiffy().  In that case
-				 * idle_time is roughly equal to the difference
-				 * between time_elapsed and "busy time" obtained
-				 * from CPU statistics.  Then, the "busy time"
-				 * can end up being greater than time_elapsed
-				 * (for example, if jiffies_64 and the CPU
-				 * statistics are updated by different CPUs),
-				 * so idle_time may in fact be negative.  That
-				 * means, though, that the CPU was busy all
-				 * the time (on the rough average) during the
-				 * last sampling interval and 100 can be
-				 * returned as the load.
-				 */
-				load = (int)idle_time < 0 ? 100 : 0;
-			}
+			else
+				load = 0;
+
 			j_cdbs->prev_load = load;
 		}
 
-		if (unlikely((int)idle_time > 2 * sampling_rate)) {
+		if (unlikely(idle_time > 2 * sampling_rate)) {
 			unsigned int periods = idle_time / sampling_rate;
 
 			if (periods < idle_periods)