diff mbox

sched: Support current clocksource handling in fallback sched_clock().

Message ID 20090526061532.GD9188@linux-sh.org (mailing list archive)
State Superseded
Headers show

Commit Message

Paul Mundt May 26, 2009, 6:15 a.m. UTC
There are presently a number of issues and limitations with how the
clocksource and sched_clock() interaction works today. Configurations
tend to be grouped in to one of the following:

	- Platform provides a low rated clocksource (< 100) and prefers
	  to use jiffies for sched_clock() due to reliability concerns.

	- Platform provides its own clocksource and sched_clock() that
	  wraps in to it.

	- Platform uses a generic clocksource (ie, drivers/clocksource/)
	  combined with the generic jiffies-backed sched_clock().

	- Platform supports a generic highly-rated clocksource but ends
	  up having to use the jiffies sched_clock() anyways.

	- Platform supports multiple highly-rated clocksources.

In the first case, simply using the rating information is sufficient to
figure out the proper course of action. In the second case, very few of
these do anything outside of the regular cyc2ns() work on the preferred
clocksource, so it tends to be more about having access to the reference
clocksource data structures more than really wanting to do any special
calculations in sched_clock().

The last few cases are presently what we are faced with on sh, and which
also impacts other drivers/clocksource drivers (while acpi_pm seems to
have alternate recourse for sched_clock(), ARM/AVR32/SH do not). In these
cases multiple clocksources can be provided, and the availability of
these will often depend on runtime constraints (pinmux and so forth), in
which case link time determination is simply not sufficient. While these
clocksources can be highly rated and can offer excellent granularity, the
jiffies clocksource is still used as a fallback given the inability to
sprinkle sched_clock() wrappers in the drivers themselves. Also, while
sched_clock() could be moved in to struct clocksource itself, this does
not help the case where sched_clock() is called in to repeatedly well
before a preferred clocksource has been determined and made available
(printk times and so on), so extra logic is needed regardless.

This patch does the only thing I could think of to address most of
these in one shot, abusing the current clocksource pointer and forcing
sched_clock() to read from it directly as soon as it becomes available
(and assuming that is is rated highly enough). This does add the cost of
the rating test on systems that only have the jiffies clocksource, but I
think this is acceptable collateral damage given that jiffies are not
very granular to begin with.

Signed-off-by: Paul Mundt <lethal@linux-sh.org>

---

 kernel/sched_clock.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Linus Walleij May 26, 2009, 2:31 p.m. UTC | #1
2009/5/26 Paul Mundt <lethal@linux-sh.org>:

>  */
>  unsigned long long __attribute__((weak)) sched_clock(void)
>  {
> +       /*
> +        * Use the current clocksource when it becomes available later in
> +        * the boot process, and ensure that it has a high enough rating
> +        * to make it suitable for general use.
> +        */
> +       if (clock && clock->rating >= 100)
> +               return cyc2ns(clock, clocksource_read(clock));
> +
> +       /* Otherwise just fall back on jiffies */
>        return (unsigned long long)(jiffies - INITIAL_JIFFIES)
>                                        * (NSEC_PER_SEC / HZ);
>  }

This seems like it would make the patch I sent the other day
unnecessary (subject u300 sched_clock() implementation).

It would also trim off this solution found in all OMAP platforms in
arch/arm/plat-omap/common.c

BUT Peter Zijlstra replied to my question about why this wasn't
generic with:

[peterz]:
> But that is the reason this isn't generic, non of the 'stable'
> clocksources on x86 are fast enough to use as sched_clock.

Does that mean clock->rating for these clocksources is
for certain < 100?

The definition of "rating" from the kerneldoc does not
seem to imply that, it's a subjective measure AFAICT.

Else you might want an additional criteria, like
cyc2ns(1) (much less than) jiffies_to_usecs(1)*1000
(however you do that the best way)
so you don't pick something
that isn't substantially faster than the jiffy counter atleast?

Yours,
Linus Walleij
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra May 26, 2009, 2:38 p.m. UTC | #2
Added the generic clock and timer folks to CC.

On Tue, 2009-05-26 at 16:31 +0200, Linus Walleij wrote:
> 2009/5/26 Paul Mundt <lethal@linux-sh.org>:
> 
> >  */
> >  unsigned long long __attribute__((weak)) sched_clock(void)
> >  {
> > +       /*
> > +        * Use the current clocksource when it becomes available later in
> > +        * the boot process, and ensure that it has a high enough rating
> > +        * to make it suitable for general use.
> > +        */
> > +       if (clock && clock->rating >= 100)
> > +               return cyc2ns(clock, clocksource_read(clock));
> > +
> > +       /* Otherwise just fall back on jiffies */
> >        return (unsigned long long)(jiffies - INITIAL_JIFFIES)
> >                                        * (NSEC_PER_SEC / HZ);
> >  }
> 
> This seems like it would make the patch I sent the other day
> unnecessary (subject u300 sched_clock() implementation).
> 
> It would also trim off this solution found in all OMAP platforms in
> arch/arm/plat-omap/common.c
> 
> BUT Peter Zijlstra replied to my question about why this wasn't
> generic with:
> 
> [peterz]:
> > But that is the reason this isn't generic, non of the 'stable'
> > clocksources on x86 are fast enough to use as sched_clock.
> 
> Does that mean clock->rating for these clocksources is
> for certain < 100?
> 
> The definition of "rating" from the kerneldoc does not
> seem to imply that, it's a subjective measure AFAICT.
> 
> Else you might want an additional criteria, like
> cyc2ns(1) (much less than) jiffies_to_usecs(1)*1000
> (however you do that the best way)
> so you don't pick something
> that isn't substantially faster than the jiffy counter atleast?




--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Mundt May 26, 2009, 2:43 p.m. UTC | #3
On Tue, May 26, 2009 at 04:31:58PM +0200, Linus Walleij wrote:
> 2009/5/26 Paul Mundt <lethal@linux-sh.org>:
> 
> >  */
> >  unsigned long long __attribute__((weak)) sched_clock(void)
> >  {
> > +       /*
> > +        * Use the current clocksource when it becomes available later in
> > +        * the boot process, and ensure that it has a high enough rating
> > +        * to make it suitable for general use.
> > +        */
> > +       if (clock && clock->rating >= 100)
> > +               return cyc2ns(clock, clocksource_read(clock));
> > +
> > +       /* Otherwise just fall back on jiffies */
> >        return (unsigned long long)(jiffies - INITIAL_JIFFIES)
> >                                        * (NSEC_PER_SEC / HZ);
> >  }
> 
> This seems like it would make the patch I sent the other day
> unnecessary (subject u300 sched_clock() implementation).
> 
> It would also trim off this solution found in all OMAP platforms in
> arch/arm/plat-omap/common.c
> 
> BUT Peter Zijlstra replied to my question about why this wasn't
> generic with:
> 
Hum, you trimmed out my changelog which explains all the rationale for
precisely why this needs to be generic. Hopefully people that care will
go back and read that.

> [peterz]:
> > But that is the reason this isn't generic, non of the 'stable'
> > clocksources on x86 are fast enough to use as sched_clock.
> 
> Does that mean clock->rating for these clocksources is
> for certain < 100?
> 
The '100' thing is a bit arbitrary, this is what defines base level
usability. If we want to set a mandate that sched_clock() sources need to
start at 300 or 400 or whatever, that is fine with me, too. In the case
of x86 there are several < 100 ratings, but I don't know if those cover
all of the cases Peter is concerned about.

Regardless, the above cyc2ns() logic does make most of the
architecture-specific sched_clock() implementations redundant, so they
can of course be killed off incrementally. This might not be the case for
x86, but in those cases I expect a different sched_clock() to be
implemented anyways.

> Else you might want an additional criteria, like
> cyc2ns(1) (much less than) jiffies_to_usecs(1)*1000
> (however you do that the best way)
> so you don't pick something
> that isn't substantially faster than the jiffy counter atleast?
> 
This rather defeats the purpose of sched_clock() being fast. If we want
to add a flag that means this in to the clocksource instead of consulting
the rating, that is fine with me too. I know which clocksources I prefer
to use for a sched_clock() and they are all better than jiffies. The
semantics of how we tell sched_clock() that are not so important. Rating
seemed like a good choice from the documentation in struct clocksource at
least.
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra May 26, 2009, 2:50 p.m. UTC | #4
On Tue, 2009-05-26 at 23:43 +0900, Paul Mundt wrote:
> > Else you might want an additional criteria, like
> > cyc2ns(1) (much less than) jiffies_to_usecs(1)*1000
> > (however you do that the best way)
> > so you don't pick something
> > that isn't substantially faster than the jiffy counter atleast?
> > 
> This rather defeats the purpose of sched_clock() being fast. If we want
> to add a flag that means this in to the clocksource instead of consulting
> the rating, that is fine with me too. I know which clocksources I prefer
> to use for a sched_clock() and they are all better than jiffies. The
> semantics of how we tell sched_clock() that are not so important. Rating
> seemed like a good choice from the documentation in struct clocksource at
> least.

Am I confused or are we talking about fast HZ vs fast cycles?

sched_clock() should be fast cycles, that is, we don't want to read a
clock that takes about 1000 cycles.

sched_clock() is about providing a high resolution clock that is fast
(low cycle count) to acquire, and need not be strictly monotonic on smp.
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul Mundt May 26, 2009, 2:53 p.m. UTC | #5
On Tue, May 26, 2009 at 04:50:09PM +0200, Peter Zijlstra wrote:
> On Tue, 2009-05-26 at 23:43 +0900, Paul Mundt wrote:
> > > Else you might want an additional criteria, like
> > > cyc2ns(1) (much less than) jiffies_to_usecs(1)*1000
> > > (however you do that the best way)
> > > so you don't pick something
> > > that isn't substantially faster than the jiffy counter atleast?
> > > 
> > This rather defeats the purpose of sched_clock() being fast. If we want
> > to add a flag that means this in to the clocksource instead of consulting
> > the rating, that is fine with me too. I know which clocksources I prefer
> > to use for a sched_clock() and they are all better than jiffies. The
> > semantics of how we tell sched_clock() that are not so important. Rating
> > seemed like a good choice from the documentation in struct clocksource at
> > least.
> 
> Am I confused or are we talking about fast HZ vs fast cycles?
> 
> sched_clock() should be fast cycles, that is, we don't want to read a
> clock that takes about 1000 cycles.
> 
> sched_clock() is about providing a high resolution clock that is fast
> (low cycle count) to acquire, and need not be strictly monotonic on smp.

I don't think there's any confusion here. My point is that I didn't want
to add too much logic in to sched_clock() given that it is supposed to be
fast. So if the rating test by itself is not sufficient, then we need
another way to flag a clocksource as being usable for sched_clock().
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matthieu CASTET May 26, 2009, 3:02 p.m. UTC | #6
Linus Walleij a écrit :
> 2009/5/26 Paul Mundt <lethal@linux-sh.org>:
> 
>>  */
>>  unsigned long long __attribute__((weak)) sched_clock(void)
>>  {
>> +       /*
>> +        * Use the current clocksource when it becomes available later in
>> +        * the boot process, and ensure that it has a high enough rating
>> +        * to make it suitable for general use.
>> +        */
>> +       if (clock && clock->rating >= 100)
>> +               return cyc2ns(clock, clocksource_read(clock));
>> +
>> +       /* Otherwise just fall back on jiffies */
>>        return (unsigned long long)(jiffies - INITIAL_JIFFIES)
>>                                        * (NSEC_PER_SEC / HZ);
>>  }
This is buggy for fast clocksource that wrap often, because sched_clock
is not supposed to wrap.
That why custom implementation use cnt32_to_63 and a smaller shift
(around 8 -10) for conversion for timer unit to ns.

Look for example to arch/arm/mach-pxa/time.c implementation [1]

But we can image you make it generic.

Matthieu

[1]
/*
 * This is PXA's sched_clock implementation. This has a resolution
 * of at least 308 ns and a maximum value of 208 days.
 *
 * The return value is guaranteed to be monotonic in that range as
 * long as there is always less than 582 seconds between successive
 * calls to sched_clock() which should always be the case in practice.
 */
#define OSCR2NS_SCALE_FACTOR 10

static unsigned long oscr2ns_scale;

static void __init set_oscr2ns_scale(unsigned long oscr_rate)
{
    unsigned long long v = 1000000000ULL << OSCR2NS_SCALE_FACTOR;
    do_div(v, oscr_rate);
    oscr2ns_scale = v;
    /*
     * We want an even value to automatically clear the top bit
     * returned by cnt32_to_63() without an additional run time
     * instruction. So if the LSB is 1 then round it up.
     */
    if (oscr2ns_scale & 1)
        oscr2ns_scale++;
}

unsigned long long sched_clock(void)
{
    unsigned long long v = cnt32_to_63(OSCR);
    return (v * oscr2ns_scale) >> OSCR2NS_SCALE_FACTOR;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner May 26, 2009, 8:17 p.m. UTC | #7
On Tue, 26 May 2009, Peter Zijlstra wrote:
> Added the generic clock and timer folks to CC.
> 
> On Tue, 2009-05-26 at 16:31 +0200, Linus Walleij wrote:
> > 2009/5/26 Paul Mundt <lethal@linux-sh.org>:
> > 
> > >  */
> > >  unsigned long long __attribute__((weak)) sched_clock(void)
> > >  {
> > > +       /*
> > > +        * Use the current clocksource when it becomes available later in
> > > +        * the boot process, and ensure that it has a high enough rating
> > > +        * to make it suitable for general use.
> > > +        */
> > > +       if (clock && clock->rating >= 100)
> > > +               return cyc2ns(clock, clocksource_read(clock));
> > > +
> > > +       /* Otherwise just fall back on jiffies */
> > >        return (unsigned long long)(jiffies - INITIAL_JIFFIES)
> > >                                        * (NSEC_PER_SEC / HZ);
> > >  }
> > 
> > This seems like it would make the patch I sent the other day
> > unnecessary (subject u300 sched_clock() implementation).
> > 
> > It would also trim off this solution found in all OMAP platforms in
> > arch/arm/plat-omap/common.c
> > 
> > BUT Peter Zijlstra replied to my question about why this wasn't
> > generic with:
> > 
> > [peterz]:
> > > But that is the reason this isn't generic, non of the 'stable'
> > > clocksources on x86 are fast enough to use as sched_clock.
> > 
> > Does that mean clock->rating for these clocksources is
> > for certain < 100?
> > 
> > The definition of "rating" from the kerneldoc does not
> > seem to imply that, it's a subjective measure AFAICT.

  Right, there is no rating threshold defined, which allows to deduce
  that. The TSC on x86 which might be unreliable, but usable as
  sched_clock has an initial rating of 300 which can be changed later
  on to 0 when the TSC is unusable as a time of day source. In that
  case clock is replaced by HPET which has a rating > 100 but is
  definitely not a good choice for sched_clock

> > Else you might want an additional criteria, like
> > cyc2ns(1) (much less than) jiffies_to_usecs(1)*1000
> > (however you do that the best way)
> > so you don't pick something
> > that isn't substantially faster than the jiffy counter atleast?

  What we can do is add another flag to the clocksource e.g.
  CLOCK_SOURCE_USE_FOR_SCHED_CLOCK and check this instead of the
  rating.

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John Stultz May 26, 2009, 8:23 p.m. UTC | #8
On Tue, 2009-05-26 at 16:38 +0200, Peter Zijlstra wrote:
> Added the generic clock and timer folks to CC.
> 
> On Tue, 2009-05-26 at 16:31 +0200, Linus Walleij wrote:
> > 2009/5/26 Paul Mundt <lethal@linux-sh.org>:
> > 
> > >  */
> > >  unsigned long long __attribute__((weak)) sched_clock(void)
> > >  {
> > > +       /*
> > > +        * Use the current clocksource when it becomes available later in
> > > +        * the boot process, and ensure that it has a high enough rating
> > > +        * to make it suitable for general use.
> > > +        */
> > > +       if (clock && clock->rating >= 100)
> > > +               return cyc2ns(clock, clocksource_read(clock));

I'm not super familiar with the recent sched_clock changes, but how will
this work if the clocksource wraps (ACPI PM wraps every 2-5 seconds).

Also there's no locking here, so the clocksource could change under you.

Further, checking for rating being greater then 100 really doesn't mean
anything. Probably need to check if the clocksource is continuous
instead.

Overall, I'd probably suggest thinking this through a bit more. At some
point doing this right will cause sched_clock() to be basically the same
as ktime_get(). So why not just use that instead of remaking it?

thanks
-john


--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra May 26, 2009, 8:30 p.m. UTC | #9
On Tue, 2009-05-26 at 13:23 -0700, john stultz wrote:
> Overall, I'd probably suggest thinking this through a bit more. At some
> point doing this right will cause sched_clock() to be basically the same
> as ktime_get(). So why not just use that instead of remaking it?

simply because we don't require the strict global monotonicy for
scheduling as we do from a regular time source (its nice to have
though).

That means that on x86 we can always use TSC for sched_clock(), even
when its quite unsuitable for ktime.



--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner May 26, 2009, 8:39 p.m. UTC | #10
On Tue, 26 May 2009, john stultz wrote:

> On Tue, 2009-05-26 at 16:38 +0200, Peter Zijlstra wrote:
> > Added the generic clock and timer folks to CC.
> > 
> > On Tue, 2009-05-26 at 16:31 +0200, Linus Walleij wrote:
> > > 2009/5/26 Paul Mundt <lethal@linux-sh.org>:
> > > 
> > > >  */
> > > >  unsigned long long __attribute__((weak)) sched_clock(void)
> > > >  {
> > > > +       /*
> > > > +        * Use the current clocksource when it becomes available later in
> > > > +        * the boot process, and ensure that it has a high enough rating
> > > > +        * to make it suitable for general use.
> > > > +        */
> > > > +       if (clock && clock->rating >= 100)
> > > > +               return cyc2ns(clock, clocksource_read(clock));
> 
> I'm not super familiar with the recent sched_clock changes, but how will
> this work if the clocksource wraps (ACPI PM wraps every 2-5 seconds).

You don't want to use ACPI PM for sched_clock, never ever.
 
> Also there's no locking here, so the clocksource could change under you.
> 
> Further, checking for rating being greater then 100 really doesn't mean
> anything. Probably need to check if the clocksource is continuous
> instead.

I'd like to have an explicit flag for this, so we can avoid that stuff
like pmtimer and other slow access clock sources are used.
 
> Overall, I'd probably suggest thinking this through a bit more. At some
> point doing this right will cause sched_clock() to be basically the same
> as ktime_get(). So why not just use that instead of remaking it?

ktime_get() involves xtime lock and the scheduler does not care about
a slightly wrong value. sched_clock does not have the accuracy
requirements of time keeping,

If we have an explicit flag we can replace lots of arch/embedded
sched_clock implementations with a generic one which is a Good Thing.

There is a world beside the broken x86 timers :)

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John Stultz May 26, 2009, 8:40 p.m. UTC | #11
On Tue, 2009-05-26 at 22:30 +0200, Peter Zijlstra wrote:
> On Tue, 2009-05-26 at 13:23 -0700, john stultz wrote:
> > Overall, I'd probably suggest thinking this through a bit more. At some
> > point doing this right will cause sched_clock() to be basically the same
> > as ktime_get(). So why not just use that instead of remaking it?
> 
> simply because we don't require the strict global monotonicy for
> scheduling as we do from a regular time source (its nice to have
> though).
> 
> That means that on x86 we can always use TSC for sched_clock(), even
> when its quite unsuitable for ktime.

Right, but I guess what I'm asking is can this be a bit better defined? 

If we are going to use clocksources (or cyclecounters - an area I need
to clean up soon), it would be good to get an idea of what is expected
of the sched_clock() interface.

So TSC good, HPET bad. Why? Is latency all we care about? How bad would
the TSC have to be before we wouldn't want to use it?

thanks
-john


--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra May 26, 2009, 8:55 p.m. UTC | #12
On Tue, 2009-05-26 at 13:40 -0700, john stultz wrote:
> On Tue, 2009-05-26 at 22:30 +0200, Peter Zijlstra wrote:
> > On Tue, 2009-05-26 at 13:23 -0700, john stultz wrote:
> > > Overall, I'd probably suggest thinking this through a bit more. At some
> > > point doing this right will cause sched_clock() to be basically the same
> > > as ktime_get(). So why not just use that instead of remaking it?
> > 
> > simply because we don't require the strict global monotonicy for
> > scheduling as we do from a regular time source (its nice to have
> > though).
> > 
> > That means that on x86 we can always use TSC for sched_clock(), even
> > when its quite unsuitable for ktime.
> 
> Right, but I guess what I'm asking is can this be a bit better defined? 
> 
> If we are going to use clocksources (or cyclecounters - an area I need
> to clean up soon), it would be good to get an idea of what is expected
> of the sched_clock() interface.
> 
> So TSC good, HPET bad. Why? 

Because TSC is a few cycles to read, and you can factorize a largish
prime while doing an HPET read :-)

> Is latency all we care about? How bad would
> the TSC have to be before we wouldn't want to use it?

Anything better than jiffies ;-)

For sched_clock() we want something high-res that is monotonic per cpu
and has a bounded drift between cpus in the order of jiffies.

Look at kernel/sched_clock.c for what we do to make really shitty TSC
conform to the above requirements.

--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John Stultz May 26, 2009, 11 p.m. UTC | #13
On Tue, 2009-05-26 at 22:55 +0200, Peter Zijlstra wrote:
> On Tue, 2009-05-26 at 13:40 -0700, john stultz wrote:
> > On Tue, 2009-05-26 at 22:30 +0200, Peter Zijlstra wrote:
> > > On Tue, 2009-05-26 at 13:23 -0700, john stultz wrote:
> > > > Overall, I'd probably suggest thinking this through a bit more. At some
> > > > point doing this right will cause sched_clock() to be basically the same
> > > > as ktime_get(). So why not just use that instead of remaking it?
> > > 
> > > simply because we don't require the strict global monotonicy for
> > > scheduling as we do from a regular time source (its nice to have
> > > though).
> > > 
> > > That means that on x86 we can always use TSC for sched_clock(), even
> > > when its quite unsuitable for ktime.
> > 
> > Right, but I guess what I'm asking is can this be a bit better defined? 
> > 
> > If we are going to use clocksources (or cyclecounters - an area I need
> > to clean up soon), it would be good to get an idea of what is expected
> > of the sched_clock() interface.
> > 
> > So TSC good, HPET bad. Why? 
> 
> Because TSC is a few cycles to read, and you can factorize a largish
> prime while doing an HPET read :-)
> 
> > Is latency all we care about? How bad would
> > the TSC have to be before we wouldn't want to use it?
> 
> Anything better than jiffies ;-)

Except HPET thought, right? :)

> For sched_clock() we want something high-res that is monotonic per cpu
> and has a bounded drift between cpus in the order of jiffies.
> 
> Look at kernel/sched_clock.c for what we do to make really shitty TSC
> conform to the above requirements.

Sure, I guess what I'm trying to pull out here is that should we try to
create some OK_FOR_SCHED_CLOCK flag for clocksources, and then we try to
make this generic so other arches can add that flag and be done, what is
the guidance we want to give to arch maintainers for setting that flag?

1) Has to be very very fast. Can we put a number on this? 50ns to read?

2) How long does it have to be monotonic for? Is it ok if it wraps every
few seconds?

If get_cycles() || jiffies is what we want, then lets leave it there. I
just want to avoid mixing the clocksource code into the sched clock code
until we really get this sort of definition sorted.

thanks
-john


--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mangalampalli, JayantX May 26, 2009, 11:24 p.m. UTC | #14
Isn't rdtsc inherently dependent on CPU tick rate, which is somewhat variable? Shouldn't HPET be more reliable? Or can constant_tsc be relied on for important things like scheduler?

Thanks
Jayant

-----Original Message-----
From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of john stultz
Sent: Tuesday, May 26, 2009 4:01 PM
To: Peter Zijlstra
Cc: Linus Walleij; Paul Mundt; Ingo Molnar; Andrew Victor; Haavard Skinnemoen; Andrew Morton; linux-kernel@vger.kernel.org; linux-sh@vger.kernel.org; linux-arm-kernel@lists.arm.linux.org.uk; John Stultz; Thomas Gleixner
Subject: Re: [PATCH] sched: Support current clocksource handling in fallback sched_clock().

On Tue, 2009-05-26 at 22:55 +0200, Peter Zijlstra wrote:
> On Tue, 2009-05-26 at 13:40 -0700, john stultz wrote:
> > On Tue, 2009-05-26 at 22:30 +0200, Peter Zijlstra wrote:
> > > On Tue, 2009-05-26 at 13:23 -0700, john stultz wrote:
> > > > Overall, I'd probably suggest thinking this through a bit more. At some
> > > > point doing this right will cause sched_clock() to be basically the same
> > > > as ktime_get(). So why not just use that instead of remaking it?
> > > 
> > > simply because we don't require the strict global monotonicy for
> > > scheduling as we do from a regular time source (its nice to have
> > > though).
> > > 
> > > That means that on x86 we can always use TSC for sched_clock(), even
> > > when its quite unsuitable for ktime.
> > 
> > Right, but I guess what I'm asking is can this be a bit better defined? 
> > 
> > If we are going to use clocksources (or cyclecounters - an area I need
> > to clean up soon), it would be good to get an idea of what is expected
> > of the sched_clock() interface.
> > 
> > So TSC good, HPET bad. Why? 
> 
> Because TSC is a few cycles to read, and you can factorize a largish
> prime while doing an HPET read :-)
> 
> > Is latency all we care about? How bad would
> > the TSC have to be before we wouldn't want to use it?
> 
> Anything better than jiffies ;-)

Except HPET thought, right? :)

> For sched_clock() we want something high-res that is monotonic per cpu
> and has a bounded drift between cpus in the order of jiffies.
> 
> Look at kernel/sched_clock.c for what we do to make really shitty TSC
> conform to the above requirements.

Sure, I guess what I'm trying to pull out here is that should we try to
create some OK_FOR_SCHED_CLOCK flag for clocksources, and then we try to
make this generic so other arches can add that flag and be done, what is
the guidance we want to give to arch maintainers for setting that flag?

1) Has to be very very fast. Can we put a number on this? 50ns to read?

2) How long does it have to be monotonic for? Is it ok if it wraps every
few seconds?

If get_cycles() || jiffies is what we want, then lets leave it there. I
just want to avoid mixing the clocksource code into the sched clock code
until we really get this sort of definition sorted.

thanks
-john


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner May 26, 2009, 11:39 p.m. UTC | #15
On Tue, 26 May 2009, john stultz wrote:
w> > > Is latency all we care about? How bad would
> > > the TSC have to be before we wouldn't want to use it?
> > 
> > Anything better than jiffies ;-)
> 
> Except HPET thought, right? :)

Eeek. HPET access serializes across cores. That's orders of magnitudes
worse than the granularity loss of jiffies.

> > For sched_clock() we want something high-res that is monotonic per cpu
> > and has a bounded drift between cpus in the order of jiffies.
> > 
> > Look at kernel/sched_clock.c for what we do to make really shitty TSC
> > conform to the above requirements.
> 
> Sure, I guess what I'm trying to pull out here is that should we try to
> create some OK_FOR_SCHED_CLOCK flag for clocksources, and then we try to
> make this generic so other arches can add that flag and be done, what is
> the guidance we want to give to arch maintainers for setting that flag?
> 
> 1) Has to be very very fast. Can we put a number on this? 50ns to read?
> 
> 2) How long does it have to be monotonic for? Is it ok if it wraps every
> few seconds?
> 
> If get_cycles() || jiffies is what we want, then lets leave it there. I
> just want to avoid mixing the clocksource code into the sched clock code
> until we really get this sort of definition sorted.

The criterion is simple. If arch maintainer decides that the access to
the particular clock source is the best compromise between granularity
and access speed vs. jiffies then he can express that by setting the
flag.

Where is the difference between that flag and an arch specific
sched_clock() implementation which overrides the weak generic one ?

That arch specific function will probably do the same thing:

     return cyc2ns(whathever_clocksource_the_maintainer_thinks_is_the_best);

So let's get rid of those sched_clock() overrrides which do nothing
else than the generic version.

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Gleixner May 27, 2009, 12:04 a.m. UTC | #16
On Tue, 26 May 2009, Mangalampalli, JayantX wrote:

1) please do not top post.
2) please fix your mail client to do proper line breaks at ~78 chars

(see http://www.tux.org/lkml/)

> Isn't rdtsc inherently dependent on CPU tick rate, which is somewhat
> variable? Shouldn't HPET be more reliable? Or can constant_tsc be
> relied on for important things like scheduler?

1) Please understand that there is an universe which has useful timer
implementations. i.e. almost everything except arch/x86

2) rdtsc is not inherently dependent on CPU frequency. It just depends
on the CPU frequency for most of the CPU implementations which have
the ability to change the CPU core frequency. Newer CPUs have core
frequency invariant TSC implementations, but they are not perfect yet
as the TSC stops in deeper C-states.

3) HPET is reliable but the hardware access to HPET is way more
expensive than the access to TSC. Also it does not scale on SMP
machines as the HPET access is serialized and worse than a global
lock. So there is a damned good reason why we went there to add the
horror of sched_clock.c to utilize TSC for hot code pathes.

Thanks,

	tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra May 27, 2009, 6:58 a.m. UTC | #17
On Tue, 2009-05-26 at 16:00 -0700, john stultz wrote:

> Sure, I guess what I'm trying to pull out here is that should we try to
> create some OK_FOR_SCHED_CLOCK flag for clocksources, and then we try to
> make this generic so other arches can add that flag and be done, what is
> the guidance we want to give to arch maintainers for setting that flag?
> 
> 1) Has to be very very fast. Can we put a number on this? 50ns to read?

I'd express it in cpu cycles, but not sure, the very fastest the arch
has :-) Just tell people this is in their scheduling hot-path and they
might understand.

> 2) How long does it have to be monotonic for? 

Forever? (per cpu)

> Is it ok if it wraps every few seconds?

No, if it wraps it needs to wrap on u64.

--
To unsubscribe from this list: send the line "unsubscribe linux-sh" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/kernel/sched_clock.c b/kernel/sched_clock.c
index e1d16c9..59bbeeb 100644
--- a/kernel/sched_clock.c
+++ b/kernel/sched_clock.c
@@ -30,6 +30,7 @@ 
 #include <linux/percpu.h>
 #include <linux/ktime.h>
 #include <linux/sched.h>
+#include <linux/clocksource.h>
 
 /*
  * Scheduler clock - returns current time in nanosec units.
@@ -38,6 +39,15 @@ 
  */
 unsigned long long __attribute__((weak)) sched_clock(void)
 {
+	/*
+	 * Use the current clocksource when it becomes available later in
+	 * the boot process, and ensure that it has a high enough rating
+	 * to make it suitable for general use.
+	 */
+	if (clock && clock->rating >= 100)
+		return cyc2ns(clock, clocksource_read(clock));
+
+	/* Otherwise just fall back on jiffies */
 	return (unsigned long long)(jiffies - INITIAL_JIFFIES)
 					* (NSEC_PER_SEC / HZ);
 }