diff mbox

cpuidle/powernv: Fix snooze timeout

Message ID 1466624203-1847-1-git-send-email-shreyas@linux.vnet.ibm.com (mailing list archive)
State Superseded, archived
Headers show

Commit Message

Shreyas B. Prabhu June 22, 2016, 7:36 p.m. UTC
Snooze is a poll idle state in powernv and pseries platforms. Snooze
has a timeout so that if a cpu stays in snooze for more than target
residency of the next available idle state, then it would exit thereby
giving chance to the cpuidle governor to re-evaluate and
promote the cpu to a deeper idle state. Therefore whenever snooze exits
due to this timeout, its last_residency will be target_residency of next
deeper state.

commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()")
changed the math around last_residency calculation. Specifically, while
converting last_residency value from nanoseconds to microseconds it does
right shift by 10. Due to this, in snooze timeout exit scenarios
last_residency calculated is roughly 2.3% less than target_residency of
next available state. This pattern is picked up get_typical_interval()
in the menu governor and therefore expected_interval in menu_select() is
frequently less than the target_residency of any state but snooze.

Due to this we are entering snooze at a higher rate, thereby affecting
the single thread performance.
Since the math around last_residency is not meant to be precise, fix this
issue setting snooze timeout to 105% of target_residency of next
available idle state.

This also adds comment around why snooze timeout is necessary.

Reported-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
---
 drivers/cpuidle/cpuidle-powernv.c | 14 ++++++++++++++
 drivers/cpuidle/cpuidle-pseries.c | 13 +++++++++++++
 2 files changed, 27 insertions(+)

Comments

Rafael J. Wysocki June 22, 2016, 10:49 p.m. UTC | #1
On Wed, Jun 22, 2016 at 9:36 PM, Shreyas B. Prabhu
<shreyas@linux.vnet.ibm.com> wrote:
> Snooze is a poll idle state in powernv and pseries platforms. Snooze
> has a timeout so that if a cpu stays in snooze for more than target
> residency of the next available idle state, then it would exit thereby
> giving chance to the cpuidle governor to re-evaluate and
> promote the cpu to a deeper idle state. Therefore whenever snooze exits
> due to this timeout, its last_residency will be target_residency of next
> deeper state.
>
> commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()")
> changed the math around last_residency calculation. Specifically, while
> converting last_residency value from nanoseconds to microseconds it does
> right shift by 10. Due to this, in snooze timeout exit scenarios
> last_residency calculated is roughly 2.3% less than target_residency of
> next available state. This pattern is picked up get_typical_interval()
> in the menu governor and therefore expected_interval in menu_select() is
> frequently less than the target_residency of any state but snooze.
>
> Due to this we are entering snooze at a higher rate, thereby affecting
> the single thread performance.
> Since the math around last_residency is not meant to be precise, fix this
> issue setting snooze timeout to 105% of target_residency of next
> available idle state.
>
> This also adds comment around why snooze timeout is necessary.

Daniel, any comments?

> Reported-by: Anton Blanchard <anton@samba.org>
> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
> ---
>  drivers/cpuidle/cpuidle-powernv.c | 14 ++++++++++++++
>  drivers/cpuidle/cpuidle-pseries.c | 13 +++++++++++++
>  2 files changed, 27 insertions(+)
>
> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
> index e12dc30..5835491 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -268,10 +268,24 @@ static int powernv_idle_probe(void)
>                 cpuidle_state_table = powernv_states;
>                 /* Device tree can indicate more idle states */
>                 max_idle_state = powernv_add_idle_states();
> +
> +               /*
> +                * Staying in snooze for a long period can degrade the
> +                * perfomance of the sibling cpus. Set timeout for snooze such
> +                * that if the cpu stays in snooze longer than target residency
> +                * of the next available idle state then exit from snooze. This
> +                * gives a chance to the cpuidle governor to re-evaluate and
> +                * promote it to deeper idle states.
> +                */
>                 if (max_idle_state > 1) {
>                         snooze_timeout_en = true;
>                         snooze_timeout = powernv_states[1].target_residency *
>                                          tb_ticks_per_usec;
> +                       /*
> +                        * Give a 5% margin since target residency related math
> +                        * is not precise in cpuidle core.
> +                        */
> +                       snooze_timeout += snooze_timeout / 20;
>                 }
>         } else
>                 return -ENODEV;
> diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c
> index 07135e0..22de841 100644
> --- a/drivers/cpuidle/cpuidle-pseries.c
> +++ b/drivers/cpuidle/cpuidle-pseries.c
> @@ -250,10 +250,23 @@ static int pseries_idle_probe(void)
>         } else
>                 return -ENODEV;
>
> +       /*
> +        * Staying in snooze for a long period can degrade the
> +        * perfomance of the sibling cpus. Set timeout for snooze such
> +        * that if the cpu stays in snooze longer than target residency
> +        * of the next available idle state then exit from snooze. This
> +        * gives a chance to the cpuidle governor to re-evaluate and
> +        * promote it to deeper idle states.
> +        */
>         if (max_idle_state > 1) {
>                 snooze_timeout_en = true;
>                 snooze_timeout = cpuidle_state_table[1].target_residency *
>                                  tb_ticks_per_usec;
> +               /*
> +                * Give a 5% margin since target residency related math
> +                * is not precise in cpuidle core.
> +                */
> +               snooze_timeout += snooze_timeout / 20;
>         }
>         return 0;
>  }
> --
> 2.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shreyas B. Prabhu June 23, 2016, 4:58 a.m. UTC | #2
On 06/23/2016 05:18 AM, Balbir Singh wrote:
> 
> 
> On 23/06/16 05:36, Shreyas B. Prabhu wrote:
>> Snooze is a poll idle state in powernv and pseries platforms. Snooze
>> has a timeout so that if a cpu stays in snooze for more than target
>> residency of the next available idle state, then it would exit thereby
>> giving chance to the cpuidle governor to re-evaluate and
>> promote the cpu to a deeper idle state. Therefore whenever snooze exits
>> due to this timeout, its last_residency will be target_residency of next
>> deeper state.
>>
>> commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()")
>> changed the math around last_residency calculation. Specifically, while
>> converting last_residency value from nanoseconds to microseconds it does
>> right shift by 10. Due to this, in snooze timeout exit scenarios
>> last_residency calculated is roughly 2.3% less than target_residency of
>> next available state. This pattern is picked up get_typical_interval()
>> in the menu governor and therefore expected_interval in menu_select() is
>> frequently less than the target_residency of any state but snooze.
>>
>> Due to this we are entering snooze at a higher rate, thereby affecting
>> the single thread performance.
>> Since the math around last_residency is not meant to be precise, fix this
>> issue setting snooze timeout to 105% of target_residency of next
>> available idle state.
>>
>> This also adds comment around why snooze timeout is necessary.
>>
>> Reported-by: Anton Blanchard <anton@samba.org>
>> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
>> ---
>>  drivers/cpuidle/cpuidle-powernv.c | 14 ++++++++++++++
>>  drivers/cpuidle/cpuidle-pseries.c | 13 +++++++++++++
>>  2 files changed, 27 insertions(+)
>>
>> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
>> index e12dc30..5835491 100644
>> --- a/drivers/cpuidle/cpuidle-powernv.c
>> +++ b/drivers/cpuidle/cpuidle-powernv.c
>> @@ -268,10 +268,24 @@ static int powernv_idle_probe(void)
>>  		cpuidle_state_table = powernv_states;
>>  		/* Device tree can indicate more idle states */
>>  		max_idle_state = powernv_add_idle_states();
>> +
>> +		/*
>> +		 * Staying in snooze for a long period can degrade the
>> +		 * perfomance of the sibling cpus. Set timeout for snooze such
>> +		 * that if the cpu stays in snooze longer than target residency
>> +		 * of the next available idle state then exit from snooze. This
>> +		 * gives a chance to the cpuidle governor to re-evaluate and
>> +		 * promote it to deeper idle states.
>> +		 */
>>  		if (max_idle_state > 1) {
>>  			snooze_timeout_en = true;
>>  			snooze_timeout = powernv_states[1].target_residency *
>>  					 tb_ticks_per_usec;
>> +			/*
>> +			 * Give a 5% margin since target residency related math
>> +			 * is not precise in cpuidle core.
>> +			 */
> 
> Is this due to the microsecond conversion mentioned above? It would be nice to
> have it in the comment. Does
> 
> (powernv_states[1].target_residency + tb_ticks_per_usec) / tb_ticks_per_usec solve
> your rounding issues, assuming the issue is really rounding or maybe it is due
> to the shift by 10, could you please elaborate on what related math is not
> precise? That would explain to me why I missed understanding your changes.
> 
>> +			snooze_timeout += snooze_timeout / 20;
> 
> For now 5% is sufficient, but do you want to check to assert to check if
> 
> snooze_timeout (in microseconds) / tb_ticks_per_usec > powernv_states[i].target_residency?
> 

This is not a rounding issue. As I mentioned in the commit message, this
is because of the last_residency calculation in cpuidle.c.
To elaborate, last residency calculation is done in the following way
after commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with
local_clock()") -

cpuidle_enter_state()
{
	[...]
	time_start = local_clock();
	[enter idle state]
	time_end = local_clock();
	/*
         * local_clock() returns the time in nanosecond, let's shift
         * by 10 (divide by 1024) to have microsecond based time.
         */
        diff = (time_end - time_start) >> 10;
	[...]
	dev->last_residency = (int) diff;
}

Because of >>10 as opposed to /1000, last_residency is lesser by 2.3%

In snooze timeout exit scenarios because of this, last_residency
calculated is 2.3% less than target_residency of next available state.
This affects get_typical_interval() in the menu governor and therefore
expected_interval in menu_select() is frequently less than the
target_residency of any state but snooze.


I'll expand the comments as you suggested.

Thanks,
Shreyas

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Education Directorate June 23, 2016, 9:28 a.m. UTC | #3
On 23/06/16 14:58, Shreyas B Prabhu wrote:
> 
> 
> On 06/23/2016 05:18 AM, Balbir Singh wrote:
>>
>>
>> On 23/06/16 05:36, Shreyas B. Prabhu wrote:
>>> Snooze is a poll idle state in powernv and pseries platforms. Snooze
>>> has a timeout so that if a cpu stays in snooze for more than target
>>> residency of the next available idle state, then it would exit thereby
>>> giving chance to the cpuidle governor to re-evaluate and
>>> promote the cpu to a deeper idle state. Therefore whenever snooze exits
>>> due to this timeout, its last_residency will be target_residency of next
>>> deeper state.
>>>
>>> commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()")
>>> changed the math around last_residency calculation. Specifically, while
>>> converting last_residency value from nanoseconds to microseconds it does
>>> right shift by 10. Due to this, in snooze timeout exit scenarios
>>> last_residency calculated is roughly 2.3% less than target_residency of
>>> next available state. This pattern is picked up get_typical_interval()
>>> in the menu governor and therefore expected_interval in menu_select() is
>>> frequently less than the target_residency of any state but snooze.
>>>
>>> Due to this we are entering snooze at a higher rate, thereby affecting
>>> the single thread performance.
>>> Since the math around last_residency is not meant to be precise, fix this
>>> issue setting snooze timeout to 105% of target_residency of next
>>> available idle state.
>>>
>>> This also adds comment around why snooze timeout is necessary.
>>>
>>> Reported-by: Anton Blanchard <anton@samba.org>
>>> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
>>> ---
>>>  drivers/cpuidle/cpuidle-powernv.c | 14 ++++++++++++++
>>>  drivers/cpuidle/cpuidle-pseries.c | 13 +++++++++++++
>>>  2 files changed, 27 insertions(+)
>>>
>>> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
>>> index e12dc30..5835491 100644
>>> --- a/drivers/cpuidle/cpuidle-powernv.c
>>> +++ b/drivers/cpuidle/cpuidle-powernv.c
>>> @@ -268,10 +268,24 @@ static int powernv_idle_probe(void)
>>>  		cpuidle_state_table = powernv_states;
>>>  		/* Device tree can indicate more idle states */
>>>  		max_idle_state = powernv_add_idle_states();
>>> +
>>> +		/*
>>> +		 * Staying in snooze for a long period can degrade the
>>> +		 * perfomance of the sibling cpus. Set timeout for snooze such
>>> +		 * that if the cpu stays in snooze longer than target residency
>>> +		 * of the next available idle state then exit from snooze. This
>>> +		 * gives a chance to the cpuidle governor to re-evaluate and
>>> +		 * promote it to deeper idle states.
>>> +		 */
>>>  		if (max_idle_state > 1) {
>>>  			snooze_timeout_en = true;
>>>  			snooze_timeout = powernv_states[1].target_residency *
>>>  					 tb_ticks_per_usec;
>>> +			/*
>>> +			 * Give a 5% margin since target residency related math
>>> +			 * is not precise in cpuidle core.
>>> +			 */
>>
>> Is this due to the microsecond conversion mentioned above? It would be nice to
>> have it in the comment. Does
>>
>> (powernv_states[1].target_residency + tb_ticks_per_usec) / tb_ticks_per_usec solve
>> your rounding issues, assuming the issue is really rounding or maybe it is due
>> to the shift by 10, could you please elaborate on what related math is not
>> precise? That would explain to me why I missed understanding your changes.
>>
>>> +			snooze_timeout += snooze_timeout / 20;
>>
>> For now 5% is sufficient, but do you want to check to assert to check if
>>
>> snooze_timeout (in microseconds) / tb_ticks_per_usec > powernv_states[i].target_residency?
>>
> 
> This is not a rounding issue. As I mentioned in the commit message, this
> is because of the last_residency calculation in cpuidle.c.
> To elaborate, last residency calculation is done in the following way
> after commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with
> local_clock()") -
> 
> cpuidle_enter_state()
> {
> 	[...]
> 	time_start = local_clock();
> 	[enter idle state]
> 	time_end = local_clock();
> 	/*
>          * local_clock() returns the time in nanosecond, let's shift
>          * by 10 (divide by 1024) to have microsecond based time.
>          */
>         diff = (time_end - time_start) >> 10;
> 	[...]
> 	dev->last_residency = (int) diff;
> }
> 
> Because of >>10 as opposed to /1000, last_residency is lesser by 2.3%


This is still a rounding error but at a different site. I see we saved
a division by doing a >> 10, but we added it right back by doing a /20
later in the platform code. Shouldn't the rounding affect other
platforms as well? Can't we fix it in cpuidle_enter_state(). Division
by 1000 can be optimized if required (but rather not add that complexity).
Thanks for patiently explaining this

Balbir
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shreyas B. Prabhu June 23, 2016, 9:41 a.m. UTC | #4
On 06/23/2016 02:58 PM, Balbir Singh wrote:
> 
> 
> On 23/06/16 14:58, Shreyas B Prabhu wrote:
>>
>>
>> On 06/23/2016 05:18 AM, Balbir Singh wrote:
>>>
>>>
>>> On 23/06/16 05:36, Shreyas B. Prabhu wrote:
>>>> Snooze is a poll idle state in powernv and pseries platforms. Snooze
>>>> has a timeout so that if a cpu stays in snooze for more than target
>>>> residency of the next available idle state, then it would exit thereby
>>>> giving chance to the cpuidle governor to re-evaluate and
>>>> promote the cpu to a deeper idle state. Therefore whenever snooze exits
>>>> due to this timeout, its last_residency will be target_residency of next
>>>> deeper state.
>>>>
>>>> commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()")
>>>> changed the math around last_residency calculation. Specifically, while
>>>> converting last_residency value from nanoseconds to microseconds it does
>>>> right shift by 10. Due to this, in snooze timeout exit scenarios
>>>> last_residency calculated is roughly 2.3% less than target_residency of
>>>> next available state. This pattern is picked up get_typical_interval()
>>>> in the menu governor and therefore expected_interval in menu_select() is
>>>> frequently less than the target_residency of any state but snooze.
>>>>
>>>> Due to this we are entering snooze at a higher rate, thereby affecting
>>>> the single thread performance.
>>>> Since the math around last_residency is not meant to be precise, fix this
>>>> issue setting snooze timeout to 105% of target_residency of next
>>>> available idle state.
>>>>
>>>> This also adds comment around why snooze timeout is necessary.
>>>>
>>>> Reported-by: Anton Blanchard <anton@samba.org>
>>>> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
>>>> ---
>>>>  drivers/cpuidle/cpuidle-powernv.c | 14 ++++++++++++++
>>>>  drivers/cpuidle/cpuidle-pseries.c | 13 +++++++++++++
>>>>  2 files changed, 27 insertions(+)
>>>>
>>>> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
>>>> index e12dc30..5835491 100644
>>>> --- a/drivers/cpuidle/cpuidle-powernv.c
>>>> +++ b/drivers/cpuidle/cpuidle-powernv.c
>>>> @@ -268,10 +268,24 @@ static int powernv_idle_probe(void)
>>>>  		cpuidle_state_table = powernv_states;
>>>>  		/* Device tree can indicate more idle states */
>>>>  		max_idle_state = powernv_add_idle_states();
>>>> +
>>>> +		/*
>>>> +		 * Staying in snooze for a long period can degrade the
>>>> +		 * perfomance of the sibling cpus. Set timeout for snooze such
>>>> +		 * that if the cpu stays in snooze longer than target residency
>>>> +		 * of the next available idle state then exit from snooze. This
>>>> +		 * gives a chance to the cpuidle governor to re-evaluate and
>>>> +		 * promote it to deeper idle states.
>>>> +		 */
>>>>  		if (max_idle_state > 1) {
>>>>  			snooze_timeout_en = true;
>>>>  			snooze_timeout = powernv_states[1].target_residency *
>>>>  					 tb_ticks_per_usec;
>>>> +			/*
>>>> +			 * Give a 5% margin since target residency related math
>>>> +			 * is not precise in cpuidle core.
>>>> +			 */
>>>
>>> Is this due to the microsecond conversion mentioned above? It would be nice to
>>> have it in the comment. Does
>>>
>>> (powernv_states[1].target_residency + tb_ticks_per_usec) / tb_ticks_per_usec solve
>>> your rounding issues, assuming the issue is really rounding or maybe it is due
>>> to the shift by 10, could you please elaborate on what related math is not
>>> precise? That would explain to me why I missed understanding your changes.
>>>
>>>> +			snooze_timeout += snooze_timeout / 20;
>>>
>>> For now 5% is sufficient, but do you want to check to assert to check if
>>>
>>> snooze_timeout (in microseconds) / tb_ticks_per_usec > powernv_states[i].target_residency?
>>>
>>
>> This is not a rounding issue. As I mentioned in the commit message, this
>> is because of the last_residency calculation in cpuidle.c.
>> To elaborate, last residency calculation is done in the following way
>> after commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with
>> local_clock()") -
>>
>> cpuidle_enter_state()
>> {
>> 	[...]
>> 	time_start = local_clock();
>> 	[enter idle state]
>> 	time_end = local_clock();
>> 	/*
>>          * local_clock() returns the time in nanosecond, let's shift
>>          * by 10 (divide by 1024) to have microsecond based time.
>>          */
>>         diff = (time_end - time_start) >> 10;
>> 	[...]
>> 	dev->last_residency = (int) diff;
>> }
>>
>> Because of >>10 as opposed to /1000, last_residency is lesser by 2.3%
> 
> 
> This is still a rounding error but at a different site. I see we saved
> a division by doing a >> 10, but we added it right back by doing a /20
> later in the platform code. 

While a >> 10 is done at every idle exit, div by 20 is done once during
boot, so this doesn't negate the previous optimization.

> Shouldn't the rounding affect other
> platforms as well? Can't we fix it in cpuidle_enter_state(). 

This does affect all platforms, but I'm guessing no other place relied
on the precision of last_residency calculations.
Daniel can perhaps comment on this.


> Division
> by 1000 can be optimized if required (but rather not add that complexity).
> Thanks for patiently explaining this
> 
> Balbir
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Education Directorate June 23, 2016, 9:55 a.m. UTC | #5
>> This is still a rounding error but at a different site. I see we saved
>> a division by doing a >> 10, but we added it right back by doing a /20
>> later in the platform code.
>
> While a >> 10 is done at every idle exit, div by 20 is done once during
> boot, so this doesn't negate the previous optimization.
>

Yes, fair point

>> Shouldn't the rounding affect other
>> platforms as well? Can't we fix it in cpuidle_enter_state().
>
> This does affect all platforms, but I'm guessing no other place relied
> on the precision of last_residency calculations.
> Daniel can perhaps comment on this.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano June 23, 2016, 10:01 a.m. UTC | #6
On 06/23/2016 11:28 AM, Balbir Singh wrote:

[ ... ]

>> cpuidle_enter_state()
>> {
>> 	[...]
>> 	time_start = local_clock();
>> 	[enter idle state]
>> 	time_end = local_clock();
>> 	/*
>>           * local_clock() returns the time in nanosecond, let's shift
>>           * by 10 (divide by 1024) to have microsecond based time.
>>           */
>>          diff = (time_end - time_start) >> 10;
>> 	[...]
>> 	dev->last_residency = (int) diff;
>> }
>>
>> Because of >>10 as opposed to /1000, last_residency is lesser by 2.3%

I am surprised the last_residency is 2.3% exactly less. The difference 
between >>10 and /1000 is 2.34%.

What is the next target residency value ?

Does it solve the issue if you replace >>10 by /1000 ?
Shreyas B. Prabhu June 23, 2016, 1:35 p.m. UTC | #7
On 06/23/2016 03:31 PM, Daniel Lezcano wrote:
> On 06/23/2016 11:28 AM, Balbir Singh wrote:
> 
> [ ... ]
> 
>>> cpuidle_enter_state()
>>> {
>>>     [...]
>>>     time_start = local_clock();
>>>     [enter idle state]
>>>     time_end = local_clock();
>>>     /*
>>>           * local_clock() returns the time in nanosecond, let's shift
>>>           * by 10 (divide by 1024) to have microsecond based time.
>>>           */
>>>          diff = (time_end - time_start) >> 10;
>>>     [...]
>>>     dev->last_residency = (int) diff;
>>> }
>>>
>>> Because of >>10 as opposed to /1000, last_residency is lesser by 2.3%
> 
> I am surprised the last_residency is 2.3% exactly less. The difference
> between >>10 and /1000 is 2.34%.
> 
> What is the next target residency value ?
> 
Target residency of the next idle state is 100 microseconds.
When snooze times out after 100 microseconds, last_residency value
calculated is typically 97 or 98 microseconds.

> Does it solve the issue if you replace >>10 by /1000 ?
> 

Yes it does.

--Shreyas

--
To unsubscribe from this list: send the line "unsubscribe linux-pm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Lezcano June 23, 2016, 2:36 p.m. UTC | #8
On 06/23/2016 03:35 PM, Shreyas B Prabhu wrote:
>
>
> On 06/23/2016 03:31 PM, Daniel Lezcano wrote:
>> On 06/23/2016 11:28 AM, Balbir Singh wrote:
>>
>> [ ... ]
>>
>>>> cpuidle_enter_state()
>>>> {
>>>>      [...]
>>>>      time_start = local_clock();
>>>>      [enter idle state]
>>>>      time_end = local_clock();
>>>>      /*
>>>>            * local_clock() returns the time in nanosecond, let's shift
>>>>            * by 10 (divide by 1024) to have microsecond based time.
>>>>            */
>>>>           diff = (time_end - time_start) >> 10;
>>>>      [...]
>>>>      dev->last_residency = (int) diff;
>>>> }
>>>>
>>>> Because of >>10 as opposed to /1000, last_residency is lesser by 2.3%
>>
>> I am surprised the last_residency is 2.3% exactly less. The difference
>> between >>10 and /1000 is 2.34%.
>>
>> What is the next target residency value ?
>>
> Target residency of the next idle state is 100 microseconds.
> When snooze times out after 100 microseconds, last_residency value
> calculated is typically 97 or 98 microseconds.

I see, the snooze exit is very fast.

>> Does it solve the issue if you replace >>10 by /1000 ?
>>
>
> Yes it does.

Ok. IMO, it would be cleaner to fix this in the core code.

   -- Daniel
diff mbox

Patch

diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c
index e12dc30..5835491 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -268,10 +268,24 @@  static int powernv_idle_probe(void)
 		cpuidle_state_table = powernv_states;
 		/* Device tree can indicate more idle states */
 		max_idle_state = powernv_add_idle_states();
+
+		/*
+		 * Staying in snooze for a long period can degrade the
+		 * perfomance of the sibling cpus. Set timeout for snooze such
+		 * that if the cpu stays in snooze longer than target residency
+		 * of the next available idle state then exit from snooze. This
+		 * gives a chance to the cpuidle governor to re-evaluate and
+		 * promote it to deeper idle states.
+		 */
 		if (max_idle_state > 1) {
 			snooze_timeout_en = true;
 			snooze_timeout = powernv_states[1].target_residency *
 					 tb_ticks_per_usec;
+			/*
+			 * Give a 5% margin since target residency related math
+			 * is not precise in cpuidle core.
+			 */
+			snooze_timeout += snooze_timeout / 20;
 		}
  	} else
  		return -ENODEV;
diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c
index 07135e0..22de841 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -250,10 +250,23 @@  static int pseries_idle_probe(void)
 	} else
 		return -ENODEV;
 
+	/*
+	 * Staying in snooze for a long period can degrade the
+	 * perfomance of the sibling cpus. Set timeout for snooze such
+	 * that if the cpu stays in snooze longer than target residency
+	 * of the next available idle state then exit from snooze. This
+	 * gives a chance to the cpuidle governor to re-evaluate and
+	 * promote it to deeper idle states.
+	 */
 	if (max_idle_state > 1) {
 		snooze_timeout_en = true;
 		snooze_timeout = cpuidle_state_table[1].target_residency *
 				 tb_ticks_per_usec;
+		/*
+		 * Give a 5% margin since target residency related math
+		 * is not precise in cpuidle core.
+		 */
+		snooze_timeout += snooze_timeout / 20;
 	}
 	return 0;
 }