Message ID | 1466624203-1847-1-git-send-email-shreyas@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
On Wed, Jun 22, 2016 at 9:36 PM, Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> wrote: > Snooze is a poll idle state in powernv and pseries platforms. Snooze > has a timeout so that if a cpu stays in snooze for more than target > residency of the next available idle state, then it would exit thereby > giving chance to the cpuidle governor to re-evaluate and > promote the cpu to a deeper idle state. Therefore whenever snooze exits > due to this timeout, its last_residency will be target_residency of next > deeper state. > > commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()") > changed the math around last_residency calculation. Specifically, while > converting last_residency value from nanoseconds to microseconds it does > right shift by 10. Due to this, in snooze timeout exit scenarios > last_residency calculated is roughly 2.3% less than target_residency of > next available state. This pattern is picked up get_typical_interval() > in the menu governor and therefore expected_interval in menu_select() is > frequently less than the target_residency of any state but snooze. > > Due to this we are entering snooze at a higher rate, thereby affecting > the single thread performance. > Since the math around last_residency is not meant to be precise, fix this > issue setting snooze timeout to 105% of target_residency of next > available idle state. > > This also adds comment around why snooze timeout is necessary. Daniel, any comments? > Reported-by: Anton Blanchard <anton@samba.org> > Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> > --- > drivers/cpuidle/cpuidle-powernv.c | 14 ++++++++++++++ > drivers/cpuidle/cpuidle-pseries.c | 13 +++++++++++++ > 2 files changed, 27 insertions(+) > > diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c > index e12dc30..5835491 100644 > --- a/drivers/cpuidle/cpuidle-powernv.c > +++ b/drivers/cpuidle/cpuidle-powernv.c > @@ -268,10 +268,24 @@ static int powernv_idle_probe(void) > cpuidle_state_table = powernv_states; > /* Device tree can indicate more idle states */ > max_idle_state = powernv_add_idle_states(); > + > + /* > + * Staying in snooze for a long period can degrade the > + * perfomance of the sibling cpus. Set timeout for snooze such > + * that if the cpu stays in snooze longer than target residency > + * of the next available idle state then exit from snooze. This > + * gives a chance to the cpuidle governor to re-evaluate and > + * promote it to deeper idle states. > + */ > if (max_idle_state > 1) { > snooze_timeout_en = true; > snooze_timeout = powernv_states[1].target_residency * > tb_ticks_per_usec; > + /* > + * Give a 5% margin since target residency related math > + * is not precise in cpuidle core. > + */ > + snooze_timeout += snooze_timeout / 20; > } > } else > return -ENODEV; > diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c > index 07135e0..22de841 100644 > --- a/drivers/cpuidle/cpuidle-pseries.c > +++ b/drivers/cpuidle/cpuidle-pseries.c > @@ -250,10 +250,23 @@ static int pseries_idle_probe(void) > } else > return -ENODEV; > > + /* > + * Staying in snooze for a long period can degrade the > + * perfomance of the sibling cpus. Set timeout for snooze such > + * that if the cpu stays in snooze longer than target residency > + * of the next available idle state then exit from snooze. This > + * gives a chance to the cpuidle governor to re-evaluate and > + * promote it to deeper idle states. > + */ > if (max_idle_state > 1) { > snooze_timeout_en = true; > snooze_timeout = cpuidle_state_table[1].target_residency * > tb_ticks_per_usec; > + /* > + * Give a 5% margin since target residency related math > + * is not precise in cpuidle core. > + */ > + snooze_timeout += snooze_timeout / 20; > } > return 0; > } > -- > 2.1.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-pm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06/23/2016 05:18 AM, Balbir Singh wrote: > > > On 23/06/16 05:36, Shreyas B. Prabhu wrote: >> Snooze is a poll idle state in powernv and pseries platforms. Snooze >> has a timeout so that if a cpu stays in snooze for more than target >> residency of the next available idle state, then it would exit thereby >> giving chance to the cpuidle governor to re-evaluate and >> promote the cpu to a deeper idle state. Therefore whenever snooze exits >> due to this timeout, its last_residency will be target_residency of next >> deeper state. >> >> commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()") >> changed the math around last_residency calculation. Specifically, while >> converting last_residency value from nanoseconds to microseconds it does >> right shift by 10. Due to this, in snooze timeout exit scenarios >> last_residency calculated is roughly 2.3% less than target_residency of >> next available state. This pattern is picked up get_typical_interval() >> in the menu governor and therefore expected_interval in menu_select() is >> frequently less than the target_residency of any state but snooze. >> >> Due to this we are entering snooze at a higher rate, thereby affecting >> the single thread performance. >> Since the math around last_residency is not meant to be precise, fix this >> issue setting snooze timeout to 105% of target_residency of next >> available idle state. >> >> This also adds comment around why snooze timeout is necessary. >> >> Reported-by: Anton Blanchard <anton@samba.org> >> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> >> --- >> drivers/cpuidle/cpuidle-powernv.c | 14 ++++++++++++++ >> drivers/cpuidle/cpuidle-pseries.c | 13 +++++++++++++ >> 2 files changed, 27 insertions(+) >> >> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c >> index e12dc30..5835491 100644 >> --- a/drivers/cpuidle/cpuidle-powernv.c >> +++ b/drivers/cpuidle/cpuidle-powernv.c >> @@ -268,10 +268,24 @@ static int powernv_idle_probe(void) >> cpuidle_state_table = powernv_states; >> /* Device tree can indicate more idle states */ >> max_idle_state = powernv_add_idle_states(); >> + >> + /* >> + * Staying in snooze for a long period can degrade the >> + * perfomance of the sibling cpus. Set timeout for snooze such >> + * that if the cpu stays in snooze longer than target residency >> + * of the next available idle state then exit from snooze. This >> + * gives a chance to the cpuidle governor to re-evaluate and >> + * promote it to deeper idle states. >> + */ >> if (max_idle_state > 1) { >> snooze_timeout_en = true; >> snooze_timeout = powernv_states[1].target_residency * >> tb_ticks_per_usec; >> + /* >> + * Give a 5% margin since target residency related math >> + * is not precise in cpuidle core. >> + */ > > Is this due to the microsecond conversion mentioned above? It would be nice to > have it in the comment. Does > > (powernv_states[1].target_residency + tb_ticks_per_usec) / tb_ticks_per_usec solve > your rounding issues, assuming the issue is really rounding or maybe it is due > to the shift by 10, could you please elaborate on what related math is not > precise? That would explain to me why I missed understanding your changes. > >> + snooze_timeout += snooze_timeout / 20; > > For now 5% is sufficient, but do you want to check to assert to check if > > snooze_timeout (in microseconds) / tb_ticks_per_usec > powernv_states[i].target_residency? > This is not a rounding issue. As I mentioned in the commit message, this is because of the last_residency calculation in cpuidle.c. To elaborate, last residency calculation is done in the following way after commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()") - cpuidle_enter_state() { [...] time_start = local_clock(); [enter idle state] time_end = local_clock(); /* * local_clock() returns the time in nanosecond, let's shift * by 10 (divide by 1024) to have microsecond based time. */ diff = (time_end - time_start) >> 10; [...] dev->last_residency = (int) diff; } Because of >>10 as opposed to /1000, last_residency is lesser by 2.3% In snooze timeout exit scenarios because of this, last_residency calculated is 2.3% less than target_residency of next available state. This affects get_typical_interval() in the menu governor and therefore expected_interval in menu_select() is frequently less than the target_residency of any state but snooze. I'll expand the comments as you suggested. Thanks, Shreyas -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 23/06/16 14:58, Shreyas B Prabhu wrote: > > > On 06/23/2016 05:18 AM, Balbir Singh wrote: >> >> >> On 23/06/16 05:36, Shreyas B. Prabhu wrote: >>> Snooze is a poll idle state in powernv and pseries platforms. Snooze >>> has a timeout so that if a cpu stays in snooze for more than target >>> residency of the next available idle state, then it would exit thereby >>> giving chance to the cpuidle governor to re-evaluate and >>> promote the cpu to a deeper idle state. Therefore whenever snooze exits >>> due to this timeout, its last_residency will be target_residency of next >>> deeper state. >>> >>> commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()") >>> changed the math around last_residency calculation. Specifically, while >>> converting last_residency value from nanoseconds to microseconds it does >>> right shift by 10. Due to this, in snooze timeout exit scenarios >>> last_residency calculated is roughly 2.3% less than target_residency of >>> next available state. This pattern is picked up get_typical_interval() >>> in the menu governor and therefore expected_interval in menu_select() is >>> frequently less than the target_residency of any state but snooze. >>> >>> Due to this we are entering snooze at a higher rate, thereby affecting >>> the single thread performance. >>> Since the math around last_residency is not meant to be precise, fix this >>> issue setting snooze timeout to 105% of target_residency of next >>> available idle state. >>> >>> This also adds comment around why snooze timeout is necessary. >>> >>> Reported-by: Anton Blanchard <anton@samba.org> >>> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> >>> --- >>> drivers/cpuidle/cpuidle-powernv.c | 14 ++++++++++++++ >>> drivers/cpuidle/cpuidle-pseries.c | 13 +++++++++++++ >>> 2 files changed, 27 insertions(+) >>> >>> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c >>> index e12dc30..5835491 100644 >>> --- a/drivers/cpuidle/cpuidle-powernv.c >>> +++ b/drivers/cpuidle/cpuidle-powernv.c >>> @@ -268,10 +268,24 @@ static int powernv_idle_probe(void) >>> cpuidle_state_table = powernv_states; >>> /* Device tree can indicate more idle states */ >>> max_idle_state = powernv_add_idle_states(); >>> + >>> + /* >>> + * Staying in snooze for a long period can degrade the >>> + * perfomance of the sibling cpus. Set timeout for snooze such >>> + * that if the cpu stays in snooze longer than target residency >>> + * of the next available idle state then exit from snooze. This >>> + * gives a chance to the cpuidle governor to re-evaluate and >>> + * promote it to deeper idle states. >>> + */ >>> if (max_idle_state > 1) { >>> snooze_timeout_en = true; >>> snooze_timeout = powernv_states[1].target_residency * >>> tb_ticks_per_usec; >>> + /* >>> + * Give a 5% margin since target residency related math >>> + * is not precise in cpuidle core. >>> + */ >> >> Is this due to the microsecond conversion mentioned above? It would be nice to >> have it in the comment. Does >> >> (powernv_states[1].target_residency + tb_ticks_per_usec) / tb_ticks_per_usec solve >> your rounding issues, assuming the issue is really rounding or maybe it is due >> to the shift by 10, could you please elaborate on what related math is not >> precise? That would explain to me why I missed understanding your changes. >> >>> + snooze_timeout += snooze_timeout / 20; >> >> For now 5% is sufficient, but do you want to check to assert to check if >> >> snooze_timeout (in microseconds) / tb_ticks_per_usec > powernv_states[i].target_residency? >> > > This is not a rounding issue. As I mentioned in the commit message, this > is because of the last_residency calculation in cpuidle.c. > To elaborate, last residency calculation is done in the following way > after commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with > local_clock()") - > > cpuidle_enter_state() > { > [...] > time_start = local_clock(); > [enter idle state] > time_end = local_clock(); > /* > * local_clock() returns the time in nanosecond, let's shift > * by 10 (divide by 1024) to have microsecond based time. > */ > diff = (time_end - time_start) >> 10; > [...] > dev->last_residency = (int) diff; > } > > Because of >>10 as opposed to /1000, last_residency is lesser by 2.3% This is still a rounding error but at a different site. I see we saved a division by doing a >> 10, but we added it right back by doing a /20 later in the platform code. Shouldn't the rounding affect other platforms as well? Can't we fix it in cpuidle_enter_state(). Division by 1000 can be optimized if required (but rather not add that complexity). Thanks for patiently explaining this Balbir -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06/23/2016 02:58 PM, Balbir Singh wrote: > > > On 23/06/16 14:58, Shreyas B Prabhu wrote: >> >> >> On 06/23/2016 05:18 AM, Balbir Singh wrote: >>> >>> >>> On 23/06/16 05:36, Shreyas B. Prabhu wrote: >>>> Snooze is a poll idle state in powernv and pseries platforms. Snooze >>>> has a timeout so that if a cpu stays in snooze for more than target >>>> residency of the next available idle state, then it would exit thereby >>>> giving chance to the cpuidle governor to re-evaluate and >>>> promote the cpu to a deeper idle state. Therefore whenever snooze exits >>>> due to this timeout, its last_residency will be target_residency of next >>>> deeper state. >>>> >>>> commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()") >>>> changed the math around last_residency calculation. Specifically, while >>>> converting last_residency value from nanoseconds to microseconds it does >>>> right shift by 10. Due to this, in snooze timeout exit scenarios >>>> last_residency calculated is roughly 2.3% less than target_residency of >>>> next available state. This pattern is picked up get_typical_interval() >>>> in the menu governor and therefore expected_interval in menu_select() is >>>> frequently less than the target_residency of any state but snooze. >>>> >>>> Due to this we are entering snooze at a higher rate, thereby affecting >>>> the single thread performance. >>>> Since the math around last_residency is not meant to be precise, fix this >>>> issue setting snooze timeout to 105% of target_residency of next >>>> available idle state. >>>> >>>> This also adds comment around why snooze timeout is necessary. >>>> >>>> Reported-by: Anton Blanchard <anton@samba.org> >>>> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> >>>> --- >>>> drivers/cpuidle/cpuidle-powernv.c | 14 ++++++++++++++ >>>> drivers/cpuidle/cpuidle-pseries.c | 13 +++++++++++++ >>>> 2 files changed, 27 insertions(+) >>>> >>>> diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c >>>> index e12dc30..5835491 100644 >>>> --- a/drivers/cpuidle/cpuidle-powernv.c >>>> +++ b/drivers/cpuidle/cpuidle-powernv.c >>>> @@ -268,10 +268,24 @@ static int powernv_idle_probe(void) >>>> cpuidle_state_table = powernv_states; >>>> /* Device tree can indicate more idle states */ >>>> max_idle_state = powernv_add_idle_states(); >>>> + >>>> + /* >>>> + * Staying in snooze for a long period can degrade the >>>> + * perfomance of the sibling cpus. Set timeout for snooze such >>>> + * that if the cpu stays in snooze longer than target residency >>>> + * of the next available idle state then exit from snooze. This >>>> + * gives a chance to the cpuidle governor to re-evaluate and >>>> + * promote it to deeper idle states. >>>> + */ >>>> if (max_idle_state > 1) { >>>> snooze_timeout_en = true; >>>> snooze_timeout = powernv_states[1].target_residency * >>>> tb_ticks_per_usec; >>>> + /* >>>> + * Give a 5% margin since target residency related math >>>> + * is not precise in cpuidle core. >>>> + */ >>> >>> Is this due to the microsecond conversion mentioned above? It would be nice to >>> have it in the comment. Does >>> >>> (powernv_states[1].target_residency + tb_ticks_per_usec) / tb_ticks_per_usec solve >>> your rounding issues, assuming the issue is really rounding or maybe it is due >>> to the shift by 10, could you please elaborate on what related math is not >>> precise? That would explain to me why I missed understanding your changes. >>> >>>> + snooze_timeout += snooze_timeout / 20; >>> >>> For now 5% is sufficient, but do you want to check to assert to check if >>> >>> snooze_timeout (in microseconds) / tb_ticks_per_usec > powernv_states[i].target_residency? >>> >> >> This is not a rounding issue. As I mentioned in the commit message, this >> is because of the last_residency calculation in cpuidle.c. >> To elaborate, last residency calculation is done in the following way >> after commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with >> local_clock()") - >> >> cpuidle_enter_state() >> { >> [...] >> time_start = local_clock(); >> [enter idle state] >> time_end = local_clock(); >> /* >> * local_clock() returns the time in nanosecond, let's shift >> * by 10 (divide by 1024) to have microsecond based time. >> */ >> diff = (time_end - time_start) >> 10; >> [...] >> dev->last_residency = (int) diff; >> } >> >> Because of >>10 as opposed to /1000, last_residency is lesser by 2.3% > > > This is still a rounding error but at a different site. I see we saved > a division by doing a >> 10, but we added it right back by doing a /20 > later in the platform code. While a >> 10 is done at every idle exit, div by 20 is done once during boot, so this doesn't negate the previous optimization. > Shouldn't the rounding affect other > platforms as well? Can't we fix it in cpuidle_enter_state(). This does affect all platforms, but I'm guessing no other place relied on the precision of last_residency calculations. Daniel can perhaps comment on this. > Division > by 1000 can be optimized if required (but rather not add that complexity). > Thanks for patiently explaining this > > Balbir > -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> This is still a rounding error but at a different site. I see we saved >> a division by doing a >> 10, but we added it right back by doing a /20 >> later in the platform code. > > While a >> 10 is done at every idle exit, div by 20 is done once during > boot, so this doesn't negate the previous optimization. > Yes, fair point >> Shouldn't the rounding affect other >> platforms as well? Can't we fix it in cpuidle_enter_state(). > > This does affect all platforms, but I'm guessing no other place relied > on the precision of last_residency calculations. > Daniel can perhaps comment on this. > -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06/23/2016 11:28 AM, Balbir Singh wrote: [ ... ] >> cpuidle_enter_state() >> { >> [...] >> time_start = local_clock(); >> [enter idle state] >> time_end = local_clock(); >> /* >> * local_clock() returns the time in nanosecond, let's shift >> * by 10 (divide by 1024) to have microsecond based time. >> */ >> diff = (time_end - time_start) >> 10; >> [...] >> dev->last_residency = (int) diff; >> } >> >> Because of >>10 as opposed to /1000, last_residency is lesser by 2.3% I am surprised the last_residency is 2.3% exactly less. The difference between >>10 and /1000 is 2.34%. What is the next target residency value ? Does it solve the issue if you replace >>10 by /1000 ?
On 06/23/2016 03:31 PM, Daniel Lezcano wrote: > On 06/23/2016 11:28 AM, Balbir Singh wrote: > > [ ... ] > >>> cpuidle_enter_state() >>> { >>> [...] >>> time_start = local_clock(); >>> [enter idle state] >>> time_end = local_clock(); >>> /* >>> * local_clock() returns the time in nanosecond, let's shift >>> * by 10 (divide by 1024) to have microsecond based time. >>> */ >>> diff = (time_end - time_start) >> 10; >>> [...] >>> dev->last_residency = (int) diff; >>> } >>> >>> Because of >>10 as opposed to /1000, last_residency is lesser by 2.3% > > I am surprised the last_residency is 2.3% exactly less. The difference > between >>10 and /1000 is 2.34%. > > What is the next target residency value ? > Target residency of the next idle state is 100 microseconds. When snooze times out after 100 microseconds, last_residency value calculated is typically 97 or 98 microseconds. > Does it solve the issue if you replace >>10 by /1000 ? > Yes it does. --Shreyas -- To unsubscribe from this list: send the line "unsubscribe linux-pm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06/23/2016 03:35 PM, Shreyas B Prabhu wrote: > > > On 06/23/2016 03:31 PM, Daniel Lezcano wrote: >> On 06/23/2016 11:28 AM, Balbir Singh wrote: >> >> [ ... ] >> >>>> cpuidle_enter_state() >>>> { >>>> [...] >>>> time_start = local_clock(); >>>> [enter idle state] >>>> time_end = local_clock(); >>>> /* >>>> * local_clock() returns the time in nanosecond, let's shift >>>> * by 10 (divide by 1024) to have microsecond based time. >>>> */ >>>> diff = (time_end - time_start) >> 10; >>>> [...] >>>> dev->last_residency = (int) diff; >>>> } >>>> >>>> Because of >>10 as opposed to /1000, last_residency is lesser by 2.3% >> >> I am surprised the last_residency is 2.3% exactly less. The difference >> between >>10 and /1000 is 2.34%. >> >> What is the next target residency value ? >> > Target residency of the next idle state is 100 microseconds. > When snooze times out after 100 microseconds, last_residency value > calculated is typically 97 or 98 microseconds. I see, the snooze exit is very fast. >> Does it solve the issue if you replace >>10 by /1000 ? >> > > Yes it does. Ok. IMO, it would be cleaner to fix this in the core code. -- Daniel
diff --git a/drivers/cpuidle/cpuidle-powernv.c b/drivers/cpuidle/cpuidle-powernv.c index e12dc30..5835491 100644 --- a/drivers/cpuidle/cpuidle-powernv.c +++ b/drivers/cpuidle/cpuidle-powernv.c @@ -268,10 +268,24 @@ static int powernv_idle_probe(void) cpuidle_state_table = powernv_states; /* Device tree can indicate more idle states */ max_idle_state = powernv_add_idle_states(); + + /* + * Staying in snooze for a long period can degrade the + * perfomance of the sibling cpus. Set timeout for snooze such + * that if the cpu stays in snooze longer than target residency + * of the next available idle state then exit from snooze. This + * gives a chance to the cpuidle governor to re-evaluate and + * promote it to deeper idle states. + */ if (max_idle_state > 1) { snooze_timeout_en = true; snooze_timeout = powernv_states[1].target_residency * tb_ticks_per_usec; + /* + * Give a 5% margin since target residency related math + * is not precise in cpuidle core. + */ + snooze_timeout += snooze_timeout / 20; } } else return -ENODEV; diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c index 07135e0..22de841 100644 --- a/drivers/cpuidle/cpuidle-pseries.c +++ b/drivers/cpuidle/cpuidle-pseries.c @@ -250,10 +250,23 @@ static int pseries_idle_probe(void) } else return -ENODEV; + /* + * Staying in snooze for a long period can degrade the + * perfomance of the sibling cpus. Set timeout for snooze such + * that if the cpu stays in snooze longer than target residency + * of the next available idle state then exit from snooze. This + * gives a chance to the cpuidle governor to re-evaluate and + * promote it to deeper idle states. + */ if (max_idle_state > 1) { snooze_timeout_en = true; snooze_timeout = cpuidle_state_table[1].target_residency * tb_ticks_per_usec; + /* + * Give a 5% margin since target residency related math + * is not precise in cpuidle core. + */ + snooze_timeout += snooze_timeout / 20; } return 0; }
Snooze is a poll idle state in powernv and pseries platforms. Snooze has a timeout so that if a cpu stays in snooze for more than target residency of the next available idle state, then it would exit thereby giving chance to the cpuidle governor to re-evaluate and promote the cpu to a deeper idle state. Therefore whenever snooze exits due to this timeout, its last_residency will be target_residency of next deeper state. commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()") changed the math around last_residency calculation. Specifically, while converting last_residency value from nanoseconds to microseconds it does right shift by 10. Due to this, in snooze timeout exit scenarios last_residency calculated is roughly 2.3% less than target_residency of next available state. This pattern is picked up get_typical_interval() in the menu governor and therefore expected_interval in menu_select() is frequently less than the target_residency of any state but snooze. Due to this we are entering snooze at a higher rate, thereby affecting the single thread performance. Since the math around last_residency is not meant to be precise, fix this issue setting snooze timeout to 105% of target_residency of next available idle state. This also adds comment around why snooze timeout is necessary. Reported-by: Anton Blanchard <anton@samba.org> Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com> --- drivers/cpuidle/cpuidle-powernv.c | 14 ++++++++++++++ drivers/cpuidle/cpuidle-pseries.c | 13 +++++++++++++ 2 files changed, 27 insertions(+)