diff mbox

[RFC] ARM: sched_clock: update epoch_cyc on resume

Message ID 1342567672-29071-1-git-send-email-ccross@android.com (mailing list archive)
State New, archived
Headers show

Commit Message

Colin Cross July 17, 2012, 11:27 p.m. UTC
Many clocks that are used to provide sched_clock will reset during
suspend.  If read_sched_clock returns 0 after suspend, sched_clock will
appear to jump forward.  This patch resets cd.epoch_cyc to the current
value of read_sched_clock during resume, which causes sched_clock() just
after suspend to return the same value as sched_clock() just before
suspend.

In addition, during the window where epoch_ns has been updated before
suspend, but epoch_cyc has not been updated after suspend, it is unknown
whether the clock has reset or not, and sched_clock() could return a
bogus value.  Add a suspended flag, and return the pre-suspend epoch_ns
value during this period.

This will have a side effect of causing SoCs that have clocks that
continue to count in suspend to appear to stop counting, reporting the
same sched_clock() value before and after suspend.

Signed-off-by: Colin Cross <ccross@android.com>
---
 arch/arm/kernel/sched_clock.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

Comments

Linus Walleij July 23, 2012, 6:55 p.m. UTC | #1
On Wed, Jul 18, 2012 at 1:27 AM, Colin Cross <ccross@android.com> wrote:

> This will have a side effect of causing SoCs that have clocks that
> continue to count in suspend to appear to stop counting, reporting the
> same sched_clock() value before and after suspend.

So for our platform (ux500) that has a sched clock that *does*
continue to run during suspend,
drivers/clocksource/clksrc-dbx500-prcmu.c
how do we opt out of this behaviour?

Since sched_clock is used for the debug prints, if we have a
crash immediately after resume() it will appear to be at resume
time in the log which kinda sucks. :-(

Isn't the proper way to do this either:

- Assign suspend/resume hooks to the sched_clock code in the
  platform and let the code that reads the hardware clock deal with
  this

Or

- If it absolutely needs to be in the core code, also have a bool
  field indicating whether the clock is going to die during suspend
  and add new registration functions for setting that sched_clock
  type, e.g. setup_sched_clock_nonsuspendable()

Yours,
Linus Walleij
Colin Cross July 23, 2012, 7:27 p.m. UTC | #2
On Mon, Jul 23, 2012 at 11:55 AM, Linus Walleij
<linus.walleij@linaro.org> wrote:
> On Wed, Jul 18, 2012 at 1:27 AM, Colin Cross <ccross@android.com> wrote:
>
>> This will have a side effect of causing SoCs that have clocks that
>> continue to count in suspend to appear to stop counting, reporting the
>> same sched_clock() value before and after suspend.
>
> So for our platform (ux500) that has a sched clock that *does*
> continue to run during suspend,
> drivers/clocksource/clksrc-dbx500-prcmu.c
> how do we opt out of this behaviour?

Does the clock you use for sched_clock continue to run in all suspend
modes?  All the SoC's I've used only have a 32kHz clock in the deepest
suspend mode, which is not ideal for sched_clock.  For example, on
Tegra2 the faster 1MHz clock used for sched_clock resets in the
deepest suspend state (LP0) but not the shallowest suspend state
(LP2), and which suspend state the chip hits depends on which hardware
is active.  Opting out of this patch would cause Tegra's clock to
sometimes run in suspend, and sometimes not, which seems worse for
debugging than consistently not running in suspend.  I'd be surprised
if a similar situation didn't apply to your platform.

> Since sched_clock is used for the debug prints, if we have a
> crash immediately after resume() it will appear to be at resume
> time in the log which kinda sucks. :-(

Most resume implementations I've seen print a resume message very
early, making it fairly easy to distinguish between suspend and resume
crashes, but I can see why an accurate timestamp would be helpful.

> Isn't the proper way to do this either:
>
> - Assign suspend/resume hooks to the sched_clock code in the
>   platform and let the code that reads the hardware clock deal with
>   this

That's effectively what I've done on 3 different platforms, which is
why I thought it might be better to put it in the core code.

> Or
>
> - If it absolutely needs to be in the core code, also have a bool
>   field indicating whether the clock is going to die during suspend
>   and add new registration functions for setting that sched_clock
>   type, e.g. setup_sched_clock_nonsuspendable()

Sounds reasonable if some platforms need the extra complexity.
Linus Walleij July 24, 2012, 12:14 a.m. UTC | #3
On Mon, Jul 23, 2012 at 9:27 PM, Colin Cross <ccross@android.com> wrote:
> On Mon, Jul 23, 2012 at 11:55 AM, Linus Walleij

> Does the clock you use for sched_clock continue to run in all suspend
> modes? All the SoC's I've used only have a 32kHz clock in the deepest
> suspend mode,

Yes, and yes it is 32kHz.

> which is not ideal for sched_clock.

Not that I know why scheduling with 32kHz is so bad compared to the
default system scheduling granularity which is HZ if you don't have
sched_clock() implemented.

Since this seems to be such an important point, what makes you
want MHz:es for scheduling granularity? To me the biggest impact
is actually the granularity of the timestamps in the printk:s.

(It's not that I doubt your needs, more curiosity.)

>  For example, on
> Tegra2 the faster 1MHz clock used for sched_clock resets in the
> deepest suspend state (LP0) but not the shallowest suspend state
> (LP2), and which suspend state the chip hits depends on which hardware
> is active.  Opting out of this patch would cause Tegra's clock to
> sometimes run in suspend, and sometimes not, which seems worse for
> debugging than consistently not running in suspend.  I'd be surprised
> if a similar situation didn't apply to your platform.

Well being able to switch between different sched_clock() providers
may be the ideal...

>> - If it absolutely needs to be in the core code, also have a bool
>>   field indicating whether the clock is going to die during suspend
>>   and add new registration functions for setting that sched_clock
>>   type, e.g. setup_sched_clock_nonsuspendable()
>
> Sounds reasonable if some platforms need the extra complexity.

OK agreed.

A connecting theme is that of being avle to flag clock sources as
sched_clock providers. If all clocksources were tagged with
rating, and only clocksources were used for sched_clock(), the
kernel could select the highest-rated clock under all circumstances.

But that's quite intrusive, more of an idea. :-P

Yours,
Linus Walleij
Colin Cross July 24, 2012, 12:28 a.m. UTC | #4
On Mon, Jul 23, 2012 at 5:14 PM, Linus Walleij <linus.walleij@linaro.org> wrote:
> On Mon, Jul 23, 2012 at 9:27 PM, Colin Cross <ccross@android.com> wrote:
>> On Mon, Jul 23, 2012 at 11:55 AM, Linus Walleij
>
>> Does the clock you use for sched_clock continue to run in all suspend
>> modes? All the SoC's I've used only have a 32kHz clock in the deepest
>> suspend mode,
>
> Yes, and yes it is 32kHz.
>
>> which is not ideal for sched_clock.
>
> Not that I know why scheduling with 32kHz is so bad compared to the
> default system scheduling granularity which is HZ if you don't have
> sched_clock() implemented.
>
> Since this seems to be such an important point, what makes you
> want MHz:es for scheduling granularity? To me the biggest impact
> is actually the granularity of the timestamps in the printk:s.
>
> (It's not that I doubt your needs, more curiosity.)

There's a comment somewhere about higher resolution sched_clock
providing fairer scheduling.  With 32 kHz sched_clock, every runtime
measured by the scheduler will be wrong by up to 31.25 us.  Most
systems have a faster clock, and if it's available it seems silly not
to use it.

It's also used for tracing, where 31.25 us resolution is a little low
for function tracing or function graph tracing.

>>  For example, on
>> Tegra2 the faster 1MHz clock used for sched_clock resets in the
>> deepest suspend state (LP0) but not the shallowest suspend state
>> (LP2), and which suspend state the chip hits depends on which hardware
>> is active.  Opting out of this patch would cause Tegra's clock to
>> sometimes run in suspend, and sometimes not, which seems worse for
>> debugging than consistently not running in suspend.  I'd be surprised
>> if a similar situation didn't apply to your platform.
>
> Well being able to switch between different sched_clock() providers
> may be the ideal...
>
>>> - If it absolutely needs to be in the core code, also have a bool
>>>   field indicating whether the clock is going to die during suspend
>>>   and add new registration functions for setting that sched_clock
>>>   type, e.g. setup_sched_clock_nonsuspendable()
>>
>> Sounds reasonable if some platforms need the extra complexity.
>
> OK agreed.
>
> A connecting theme is that of being avle to flag clock sources as
> sched_clock providers. If all clocksources were tagged with
> rating, and only clocksources were used for sched_clock(), the
> kernel could select the highest-rated clock under all circumstances.
>
> But that's quite intrusive, more of an idea. :-P

sched_clock is supposed to be very low overhead compared to ktime_get,
and has some strict  requirements if CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
is not set (see kernel/sched/clock.c), but it might be possible.
Barry Song July 24, 2012, 6:43 a.m. UTC | #5
2012/7/18 Colin Cross <ccross@android.com>:
> Many clocks that are used to provide sched_clock will reset during
> suspend.  If read_sched_clock returns 0 after suspend, sched_clock will
> appear to jump forward.  This patch resets cd.epoch_cyc to the current
> value of read_sched_clock during resume, which causes sched_clock() just
> after suspend to return the same value as sched_clock() just before
> suspend.
>
> In addition, during the window where epoch_ns has been updated before
> suspend, but epoch_cyc has not been updated after suspend, it is unknown
> whether the clock has reset or not, and sched_clock() could return a
> bogus value.  Add a suspended flag, and return the pre-suspend epoch_ns
> value during this period.

Acked-by: Barry Song <21cnbao@gmail.com>

this patch should also fix the issue that:
1. launch some rt threads, rt threads sleep before suspend
2. repeat to suspend/resume
3. after resuming, waking up rt threads

repeat 1-3 again and again, sometimes all rt threads will hang after
resuming due to wrong sched_clock will make sched_rt think rt_time is
much more than rt_runtime (default 950ms in 1s). then rt threads will
lost cpu timeslot to run since the 95% threshold is there.

>
> This will have a side effect of causing SoCs that have clocks that
> continue to count in suspend to appear to stop counting, reporting the
> same sched_clock() value before and after suspend.
>
> Signed-off-by: Colin Cross <ccross@android.com>
> ---
>  arch/arm/kernel/sched_clock.c |   13 +++++++++++++
>  1 files changed, 13 insertions(+), 0 deletions(-)
>
> diff --git a/arch/arm/kernel/sched_clock.c b/arch/arm/kernel/sched_clock.c
> index 27d186a..46c7d32 100644
> --- a/arch/arm/kernel/sched_clock.c
> +++ b/arch/arm/kernel/sched_clock.c
> @@ -21,6 +21,7 @@ struct clock_data {
>         u32 epoch_cyc_copy;
>         u32 mult;
>         u32 shift;
> +       bool suspended;
>  };
>
>  static void sched_clock_poll(unsigned long wrap_ticks);
> @@ -49,6 +50,9 @@ static unsigned long long cyc_to_sched_clock(u32 cyc, u32 mask)
>         u64 epoch_ns;
>         u32 epoch_cyc;
>
> +       if (cd.suspended)
> +               return cd.epoch_ns;
> +
>         /*
>          * Load the epoch_cyc and epoch_ns atomically.  We do this by
>          * ensuring that we always write epoch_cyc, epoch_ns and
> @@ -169,11 +173,20 @@ void __init sched_clock_postinit(void)
>  static int sched_clock_suspend(void)
>  {
>         sched_clock_poll(sched_clock_timer.data);
> +       cd.suspended = true;
>         return 0;
>  }
>
> +static void sched_clock_resume(void)
> +{
> +       cd.epoch_cyc = read_sched_clock();
> +       cd.epoch_cyc_copy = cd.epoch_cyc;
> +       cd.suspended = false;
> +}
> +
>  static struct syscore_ops sched_clock_ops = {
>         .suspend = sched_clock_suspend,
> +       .resume = sched_clock_resume,
>  };
>
>  static int __init sched_clock_syscore_init(void)


-barry
Vaibhav Bedia July 24, 2012, 9:16 a.m. UTC | #6
On Tue, Jul 24, 2012 at 05:58:32, Colin Cross wrote:
> On Mon, Jul 23, 2012 at 5:14 PM, Linus Walleij <linus.walleij@linaro.org> wrote:
> > On Mon, Jul 23, 2012 at 9:27 PM, Colin Cross <ccross@android.com> wrote:
> >> On Mon, Jul 23, 2012 at 11:55 AM, Linus Walleij
> >
> >> Does the clock you use for sched_clock continue to run in all suspend
> >> modes? All the SoC's I've used only have a 32kHz clock in the deepest
> >> suspend mode,
> >
> > Yes, and yes it is 32kHz.
> >

On the platform that I am working on (AM335x) the sched_clock continues to
run and yes even on this it runs @ 32KHz.

> >> which is not ideal for sched_clock.
> >
> > Not that I know why scheduling with 32kHz is so bad compared to the
> > default system scheduling granularity which is HZ if you don't have
> > sched_clock() implemented.
> >
> > Since this seems to be such an important point, what makes you
> > want MHz:es for scheduling granularity? To me the biggest impact
> > is actually the granularity of the timestamps in the printk:s.
> >

I personally find the timestamps in the printks very late in the suspend
or very early in the resume extremely helpful for debugging. 
The printk timestamps also help figure out which portions of the platform
suspend code are taking long and can be optimized.

> > (It's not that I doubt your needs, more curiosity.)
> 
> There's a comment somewhere about higher resolution sched_clock
> providing fairer scheduling.  With 32 kHz sched_clock, every runtime
> measured by the scheduler will be wrong by up to 31.25 us.  Most
> systems have a faster clock, and if it's available it seems silly not
> to use it.
> 
> It's also used for tracing, where 31.25 us resolution is a little low
> for function tracing or function graph tracing.
> 
> >>  For example, on
> >> Tegra2 the faster 1MHz clock used for sched_clock resets in the
> >> deepest suspend state (LP0) but not the shallowest suspend state
> >> (LP2), and which suspend state the chip hits depends on which hardware
> >> is active.  Opting out of this patch would cause Tegra's clock to
> >> sometimes run in suspend, and sometimes not, which seems worse for
> >> debugging than consistently not running in suspend.  I'd be surprised
> >> if a similar situation didn't apply to your platform.
> >

Is there any other slower on Tegra2 which doesn't reset and hence can be used
for sched_clock in LP0 as well as LP2? 

There's a trade-off that the end-user might choose to make - 
either use a faster clock for sched_clock which stops in suspend
or use a slower, in most cases 32KHz, clock which doesn't.
IMO this aspect should be configurable.

> > Well being able to switch between different sched_clock() providers
> > may be the ideal...
> >

Hmm... not exactly. More below.

> >>> - If it absolutely needs to be in the core code, also have a bool
> >>>   field indicating whether the clock is going to die during suspend
> >>>   and add new registration functions for setting that sched_clock
> >>>   type, e.g. setup_sched_clock_nonsuspendable()
> >>
> >> Sounds reasonable if some platforms need the extra complexity.
> >

I am not sure but I am guessing even on OMAPx the timer used as sched_clock
keeps running.

> > OK agreed.
> >
> > A connecting theme is that of being avle to flag clock sources as
> > sched_clock providers. If all clocksources were tagged with
> > rating, and only clocksources were used for sched_clock(), the
> > kernel could select the highest-rated clock under all circumstances.
> >
> > But that's quite intrusive, more of an idea. :-P
> 

Too intrusive I guess ;)

There were some discussions on this in the context of AM335x [1] [2].
Right now only sched_clock can be registered and I guess this restriction
is not going to go away any time soon.

Regards,
Vaibhav B.

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2012-January/080862.html 
[2] http://lists.infradead.org/pipermail/linux-arm-kernel/2012-April/092890.html


> sched_clock is supposed to be very low overhead compared to ktime_get,
> and has some strict  requirements if CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
> is not set (see kernel/sched/clock.c), but it might be possible.
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>
Linus Walleij July 27, 2012, 10:23 p.m. UTC | #7
On Tue, Jul 24, 2012 at 11:16 AM, Bedia, Vaibhav <vaibhav.bedia@ti.com> wrote:

>> > A connecting theme is that of being avle to flag clock sources as
>> > sched_clock providers. If all clocksources were tagged with
>> > rating, and only clocksources were used for sched_clock(), the
>> > kernel could select the highest-rated clock under all circumstances.
>> >
>> > But that's quite intrusive, more of an idea. :-P
>>
>
> Too intrusive I guess ;)
>
> There were some discussions on this in the context of AM335x [1] [2].
> Right now only sched_clock can be registered and I guess this restriction
> is not going to go away any time soon.

Why do you think that? The restriction to only assign sched_clock() at
compile-time was recently removed so its now runtime assigned.

So yes, a clock source that can die and change frequency is no good
as primary system time, but the abstraction could still be used for
those that do, just add another flag NOT_CONTINUOUS or so, and
make sure the system does not select this for primary system
clock source.

Then modelling sched_clock() on clock sources makes all more
sense: just select the best one. For primary system clock source,
do not select one which is non-continous.

Yours,
Linus Walleij
Russell King - ARM Linux July 27, 2012, 10:41 p.m. UTC | #8
On Sat, Jul 28, 2012 at 12:23:05AM +0200, Linus Walleij wrote:
> On Tue, Jul 24, 2012 at 11:16 AM, Bedia, Vaibhav <vaibhav.bedia@ti.com> wrote:
> 
> >> > A connecting theme is that of being avle to flag clock sources as
> >> > sched_clock providers. If all clocksources were tagged with
> >> > rating, and only clocksources were used for sched_clock(), the
> >> > kernel could select the highest-rated clock under all circumstances.
> >> >
> >> > But that's quite intrusive, more of an idea. :-P
> >>
> >
> > Too intrusive I guess ;)
> >
> > There were some discussions on this in the context of AM335x [1] [2].
> > Right now only sched_clock can be registered and I guess this restriction
> > is not going to go away any time soon.
> 
> Why do you think that? The restriction to only assign sched_clock() at
> compile-time was recently removed so its now runtime assigned.
> 
> So yes, a clock source that can die and change frequency is no good
> as primary system time, but the abstraction could still be used for
> those that do, just add another flag NOT_CONTINUOUS or so, and
> make sure the system does not select this for primary system
> clock source.
> 
> Then modelling sched_clock() on clock sources makes all more
> sense: just select the best one. For primary system clock source,
> do not select one which is non-continous.

Except for the overhead.  sched_clock() is supposed to be as _fast_ as
it can possibly be, because it has a direct impact on the performance
of the system.
Vaibhav Bedia July 30, 2012, 12:31 p.m. UTC | #9
On Sat, Jul 28, 2012 at 03:53:05, Linus Walleij wrote:
> On Tue, Jul 24, 2012 at 11:16 AM, Bedia, Vaibhav <vaibhav.bedia@ti.com> wrote:
> 
> >> > A connecting theme is that of being avle to flag clock sources as
> >> > sched_clock providers. If all clocksources were tagged with
> >> > rating, and only clocksources were used for sched_clock(), the
> >> > kernel could select the highest-rated clock under all circumstances.
> >> >
> >> > But that's quite intrusive, more of an idea. :-P
> >>
> >
> > Too intrusive I guess ;)
> >
> > There were some discussions on this in the context of AM335x [1] [2].
> > Right now only sched_clock can be registered and I guess this restriction
> > is not going to go away any time soon.
> 
> Why do you think that? The restriction to only assign sched_clock() at
> compile-time was recently removed so its now runtime assigned.
> 

Yes it's runtime assigned but multiple calls to setup_sched_clock()
will trigger a WARN_ON() from arch/arm/kernel/sched_clock.c
unless I missing something basic here.

> So yes, a clock source that can die and change frequency is no good
> as primary system time, but the abstraction could still be used for
> those that do, just add another flag NOT_CONTINUOUS or so, and
> make sure the system does not select this for primary system
> clock source.
> 
> Then modelling sched_clock() on clock sources makes all more
> sense: just select the best one. For primary system clock source,
> do not select one which is non-continous.
> 

I think what you are suggesting above is similar to what AM335x was trying
to do but Russell pointed out issues with switching of timing sources and
that's what I was trying to highlight.

Regards,
Vaibhav B.
Kevin Hilman Oct. 19, 2012, 11:58 p.m. UTC | #10
Barry Song <21cnbao@gmail.com> writes:

> 2012/7/18 Colin Cross <ccross@android.com>:
>> Many clocks that are used to provide sched_clock will reset during
>> suspend.  If read_sched_clock returns 0 after suspend, sched_clock will
>> appear to jump forward.  This patch resets cd.epoch_cyc to the current
>> value of read_sched_clock during resume, which causes sched_clock() just
>> after suspend to return the same value as sched_clock() just before
>> suspend.
>>
>> In addition, during the window where epoch_ns has been updated before
>> suspend, but epoch_cyc has not been updated after suspend, it is unknown
>> whether the clock has reset or not, and sched_clock() could return a
>> bogus value.  Add a suspended flag, and return the pre-suspend epoch_ns
>> value during this period.
>
> Acked-by: Barry Song <21cnbao@gmail.com>
>
> this patch should also fix the issue that:
> 1. launch some rt threads, rt threads sleep before suspend
> 2. repeat to suspend/resume
> 3. after resuming, waking up rt threads
>
> repeat 1-3 again and again, sometimes all rt threads will hang after
> resuming due to wrong sched_clock will make sched_rt think rt_time is
> much more than rt_runtime (default 950ms in 1s). then rt threads will
> lost cpu timeslot to run since the 95% threshold is there.

Re-visiting this in light of a related problem.

I've run into a similar issue where IRQ threads are prevented from
running during resume becase the RT throttling kicks because RT
runtime is accumulated during suspend.  Using the 'needs_suspend'
version fixes this problem too.

However, because of the RT throttling issue, it seems like *all*
platforms should be using the 'needs_suspend' version always.  But, as
already pointed out, that makes the timed printk output during
suspend/resume rather unhelpful.

Having to choose between useful printk times during suspend/resume and
functioning IRQ threads during suspend/resume isn't a choice I want to
make.   I'd rather have both.  Any ideas?

Kevin
diff mbox

Patch

diff --git a/arch/arm/kernel/sched_clock.c b/arch/arm/kernel/sched_clock.c
index 27d186a..46c7d32 100644
--- a/arch/arm/kernel/sched_clock.c
+++ b/arch/arm/kernel/sched_clock.c
@@ -21,6 +21,7 @@  struct clock_data {
 	u32 epoch_cyc_copy;
 	u32 mult;
 	u32 shift;
+	bool suspended;
 };
 
 static void sched_clock_poll(unsigned long wrap_ticks);
@@ -49,6 +50,9 @@  static unsigned long long cyc_to_sched_clock(u32 cyc, u32 mask)
 	u64 epoch_ns;
 	u32 epoch_cyc;
 
+	if (cd.suspended)
+		return cd.epoch_ns;
+
 	/*
 	 * Load the epoch_cyc and epoch_ns atomically.  We do this by
 	 * ensuring that we always write epoch_cyc, epoch_ns and
@@ -169,11 +173,20 @@  void __init sched_clock_postinit(void)
 static int sched_clock_suspend(void)
 {
 	sched_clock_poll(sched_clock_timer.data);
+	cd.suspended = true;
 	return 0;
 }
 
+static void sched_clock_resume(void)
+{
+	cd.epoch_cyc = read_sched_clock();
+	cd.epoch_cyc_copy = cd.epoch_cyc;
+	cd.suspended = false;
+}
+
 static struct syscore_ops sched_clock_ops = {
 	.suspend = sched_clock_suspend,
+	.resume = sched_clock_resume,
 };
 
 static int __init sched_clock_syscore_init(void)