Message ID | 20090526061532.GD9188@linux-sh.org (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
2009/5/26 Paul Mundt <lethal@linux-sh.org>: > */ > unsigned long long __attribute__((weak)) sched_clock(void) > { > + /* > + * Use the current clocksource when it becomes available later in > + * the boot process, and ensure that it has a high enough rating > + * to make it suitable for general use. > + */ > + if (clock && clock->rating >= 100) > + return cyc2ns(clock, clocksource_read(clock)); > + > + /* Otherwise just fall back on jiffies */ > return (unsigned long long)(jiffies - INITIAL_JIFFIES) > * (NSEC_PER_SEC / HZ); > } This seems like it would make the patch I sent the other day unnecessary (subject u300 sched_clock() implementation). It would also trim off this solution found in all OMAP platforms in arch/arm/plat-omap/common.c BUT Peter Zijlstra replied to my question about why this wasn't generic with: [peterz]: > But that is the reason this isn't generic, non of the 'stable' > clocksources on x86 are fast enough to use as sched_clock. Does that mean clock->rating for these clocksources is for certain < 100? The definition of "rating" from the kerneldoc does not seem to imply that, it's a subjective measure AFAICT. Else you might want an additional criteria, like cyc2ns(1) (much less than) jiffies_to_usecs(1)*1000 (however you do that the best way) so you don't pick something that isn't substantially faster than the jiffy counter atleast? Yours, Linus Walleij -- To unsubscribe from this list: send the line "unsubscribe linux-sh" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Added the generic clock and timer folks to CC. On Tue, 2009-05-26 at 16:31 +0200, Linus Walleij wrote: > 2009/5/26 Paul Mundt <lethal@linux-sh.org>: > > > */ > > unsigned long long __attribute__((weak)) sched_clock(void) > > { > > + /* > > + * Use the current clocksource when it becomes available later in > > + * the boot process, and ensure that it has a high enough rating > > + * to make it suitable for general use. > > + */ > > + if (clock && clock->rating >= 100) > > + return cyc2ns(clock, clocksource_read(clock)); > > + > > + /* Otherwise just fall back on jiffies */ > > return (unsigned long long)(jiffies - INITIAL_JIFFIES) > > * (NSEC_PER_SEC / HZ); > > } > > This seems like it would make the patch I sent the other day > unnecessary (subject u300 sched_clock() implementation). > > It would also trim off this solution found in all OMAP platforms in > arch/arm/plat-omap/common.c > > BUT Peter Zijlstra replied to my question about why this wasn't > generic with: > > [peterz]: > > But that is the reason this isn't generic, non of the 'stable' > > clocksources on x86 are fast enough to use as sched_clock. > > Does that mean clock->rating for these clocksources is > for certain < 100? > > The definition of "rating" from the kerneldoc does not > seem to imply that, it's a subjective measure AFAICT. > > Else you might want an additional criteria, like > cyc2ns(1) (much less than) jiffies_to_usecs(1)*1000 > (however you do that the best way) > so you don't pick something > that isn't substantially faster than the jiffy counter atleast? -- To unsubscribe from this list: send the line "unsubscribe linux-sh" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 26, 2009 at 04:31:58PM +0200, Linus Walleij wrote: > 2009/5/26 Paul Mundt <lethal@linux-sh.org>: > > > */ > > unsigned long long __attribute__((weak)) sched_clock(void) > > { > > + /* > > + * Use the current clocksource when it becomes available later in > > + * the boot process, and ensure that it has a high enough rating > > + * to make it suitable for general use. > > + */ > > + if (clock && clock->rating >= 100) > > + return cyc2ns(clock, clocksource_read(clock)); > > + > > + /* Otherwise just fall back on jiffies */ > > return (unsigned long long)(jiffies - INITIAL_JIFFIES) > > * (NSEC_PER_SEC / HZ); > > } > > This seems like it would make the patch I sent the other day > unnecessary (subject u300 sched_clock() implementation). > > It would also trim off this solution found in all OMAP platforms in > arch/arm/plat-omap/common.c > > BUT Peter Zijlstra replied to my question about why this wasn't > generic with: > Hum, you trimmed out my changelog which explains all the rationale for precisely why this needs to be generic. Hopefully people that care will go back and read that. > [peterz]: > > But that is the reason this isn't generic, non of the 'stable' > > clocksources on x86 are fast enough to use as sched_clock. > > Does that mean clock->rating for these clocksources is > for certain < 100? > The '100' thing is a bit arbitrary, this is what defines base level usability. If we want to set a mandate that sched_clock() sources need to start at 300 or 400 or whatever, that is fine with me, too. In the case of x86 there are several < 100 ratings, but I don't know if those cover all of the cases Peter is concerned about. Regardless, the above cyc2ns() logic does make most of the architecture-specific sched_clock() implementations redundant, so they can of course be killed off incrementally. This might not be the case for x86, but in those cases I expect a different sched_clock() to be implemented anyways. > Else you might want an additional criteria, like > cyc2ns(1) (much less than) jiffies_to_usecs(1)*1000 > (however you do that the best way) > so you don't pick something > that isn't substantially faster than the jiffy counter atleast? > This rather defeats the purpose of sched_clock() being fast. If we want to add a flag that means this in to the clocksource instead of consulting the rating, that is fine with me too. I know which clocksources I prefer to use for a sched_clock() and they are all better than jiffies. The semantics of how we tell sched_clock() that are not so important. Rating seemed like a good choice from the documentation in struct clocksource at least. -- To unsubscribe from this list: send the line "unsubscribe linux-sh" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2009-05-26 at 23:43 +0900, Paul Mundt wrote: > > Else you might want an additional criteria, like > > cyc2ns(1) (much less than) jiffies_to_usecs(1)*1000 > > (however you do that the best way) > > so you don't pick something > > that isn't substantially faster than the jiffy counter atleast? > > > This rather defeats the purpose of sched_clock() being fast. If we want > to add a flag that means this in to the clocksource instead of consulting > the rating, that is fine with me too. I know which clocksources I prefer > to use for a sched_clock() and they are all better than jiffies. The > semantics of how we tell sched_clock() that are not so important. Rating > seemed like a good choice from the documentation in struct clocksource at > least. Am I confused or are we talking about fast HZ vs fast cycles? sched_clock() should be fast cycles, that is, we don't want to read a clock that takes about 1000 cycles. sched_clock() is about providing a high resolution clock that is fast (low cycle count) to acquire, and need not be strictly monotonic on smp. -- To unsubscribe from this list: send the line "unsubscribe linux-sh" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 26, 2009 at 04:50:09PM +0200, Peter Zijlstra wrote: > On Tue, 2009-05-26 at 23:43 +0900, Paul Mundt wrote: > > > Else you might want an additional criteria, like > > > cyc2ns(1) (much less than) jiffies_to_usecs(1)*1000 > > > (however you do that the best way) > > > so you don't pick something > > > that isn't substantially faster than the jiffy counter atleast? > > > > > This rather defeats the purpose of sched_clock() being fast. If we want > > to add a flag that means this in to the clocksource instead of consulting > > the rating, that is fine with me too. I know which clocksources I prefer > > to use for a sched_clock() and they are all better than jiffies. The > > semantics of how we tell sched_clock() that are not so important. Rating > > seemed like a good choice from the documentation in struct clocksource at > > least. > > Am I confused or are we talking about fast HZ vs fast cycles? > > sched_clock() should be fast cycles, that is, we don't want to read a > clock that takes about 1000 cycles. > > sched_clock() is about providing a high resolution clock that is fast > (low cycle count) to acquire, and need not be strictly monotonic on smp. I don't think there's any confusion here. My point is that I didn't want to add too much logic in to sched_clock() given that it is supposed to be fast. So if the rating test by itself is not sufficient, then we need another way to flag a clocksource as being usable for sched_clock(). -- To unsubscribe from this list: send the line "unsubscribe linux-sh" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Linus Walleij a écrit : > 2009/5/26 Paul Mundt <lethal@linux-sh.org>: > >> */ >> unsigned long long __attribute__((weak)) sched_clock(void) >> { >> + /* >> + * Use the current clocksource when it becomes available later in >> + * the boot process, and ensure that it has a high enough rating >> + * to make it suitable for general use. >> + */ >> + if (clock && clock->rating >= 100) >> + return cyc2ns(clock, clocksource_read(clock)); >> + >> + /* Otherwise just fall back on jiffies */ >> return (unsigned long long)(jiffies - INITIAL_JIFFIES) >> * (NSEC_PER_SEC / HZ); >> } This is buggy for fast clocksource that wrap often, because sched_clock is not supposed to wrap. That why custom implementation use cnt32_to_63 and a smaller shift (around 8 -10) for conversion for timer unit to ns. Look for example to arch/arm/mach-pxa/time.c implementation [1] But we can image you make it generic. Matthieu [1] /* * This is PXA's sched_clock implementation. This has a resolution * of at least 308 ns and a maximum value of 208 days. * * The return value is guaranteed to be monotonic in that range as * long as there is always less than 582 seconds between successive * calls to sched_clock() which should always be the case in practice. */ #define OSCR2NS_SCALE_FACTOR 10 static unsigned long oscr2ns_scale; static void __init set_oscr2ns_scale(unsigned long oscr_rate) { unsigned long long v = 1000000000ULL << OSCR2NS_SCALE_FACTOR; do_div(v, oscr_rate); oscr2ns_scale = v; /* * We want an even value to automatically clear the top bit * returned by cnt32_to_63() without an additional run time * instruction. So if the LSB is 1 then round it up. */ if (oscr2ns_scale & 1) oscr2ns_scale++; } unsigned long long sched_clock(void) { unsigned long long v = cnt32_to_63(OSCR); return (v * oscr2ns_scale) >> OSCR2NS_SCALE_FACTOR; } -- To unsubscribe from this list: send the line "unsubscribe linux-sh" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 26 May 2009, Peter Zijlstra wrote: > Added the generic clock and timer folks to CC. > > On Tue, 2009-05-26 at 16:31 +0200, Linus Walleij wrote: > > 2009/5/26 Paul Mundt <lethal@linux-sh.org>: > > > > > */ > > > unsigned long long __attribute__((weak)) sched_clock(void) > > > { > > > + /* > > > + * Use the current clocksource when it becomes available later in > > > + * the boot process, and ensure that it has a high enough rating > > > + * to make it suitable for general use. > > > + */ > > > + if (clock && clock->rating >= 100) > > > + return cyc2ns(clock, clocksource_read(clock)); > > > + > > > + /* Otherwise just fall back on jiffies */ > > > return (unsigned long long)(jiffies - INITIAL_JIFFIES) > > > * (NSEC_PER_SEC / HZ); > > > } > > > > This seems like it would make the patch I sent the other day > > unnecessary (subject u300 sched_clock() implementation). > > > > It would also trim off this solution found in all OMAP platforms in > > arch/arm/plat-omap/common.c > > > > BUT Peter Zijlstra replied to my question about why this wasn't > > generic with: > > > > [peterz]: > > > But that is the reason this isn't generic, non of the 'stable' > > > clocksources on x86 are fast enough to use as sched_clock. > > > > Does that mean clock->rating for these clocksources is > > for certain < 100? > > > > The definition of "rating" from the kerneldoc does not > > seem to imply that, it's a subjective measure AFAICT. Right, there is no rating threshold defined, which allows to deduce that. The TSC on x86 which might be unreliable, but usable as sched_clock has an initial rating of 300 which can be changed later on to 0 when the TSC is unusable as a time of day source. In that case clock is replaced by HPET which has a rating > 100 but is definitely not a good choice for sched_clock > > Else you might want an additional criteria, like > > cyc2ns(1) (much less than) jiffies_to_usecs(1)*1000 > > (however you do that the best way) > > so you don't pick something > > that isn't substantially faster than the jiffy counter atleast? What we can do is add another flag to the clocksource e.g. CLOCK_SOURCE_USE_FOR_SCHED_CLOCK and check this instead of the rating. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-sh" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2009-05-26 at 16:38 +0200, Peter Zijlstra wrote: > Added the generic clock and timer folks to CC. > > On Tue, 2009-05-26 at 16:31 +0200, Linus Walleij wrote: > > 2009/5/26 Paul Mundt <lethal@linux-sh.org>: > > > > > */ > > > unsigned long long __attribute__((weak)) sched_clock(void) > > > { > > > + /* > > > + * Use the current clocksource when it becomes available later in > > > + * the boot process, and ensure that it has a high enough rating > > > + * to make it suitable for general use. > > > + */ > > > + if (clock && clock->rating >= 100) > > > + return cyc2ns(clock, clocksource_read(clock)); I'm not super familiar with the recent sched_clock changes, but how will this work if the clocksource wraps (ACPI PM wraps every 2-5 seconds). Also there's no locking here, so the clocksource could change under you. Further, checking for rating being greater then 100 really doesn't mean anything. Probably need to check if the clocksource is continuous instead. Overall, I'd probably suggest thinking this through a bit more. At some point doing this right will cause sched_clock() to be basically the same as ktime_get(). So why not just use that instead of remaking it? thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-sh" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2009-05-26 at 13:23 -0700, john stultz wrote: > Overall, I'd probably suggest thinking this through a bit more. At some > point doing this right will cause sched_clock() to be basically the same > as ktime_get(). So why not just use that instead of remaking it? simply because we don't require the strict global monotonicy for scheduling as we do from a regular time source (its nice to have though). That means that on x86 we can always use TSC for sched_clock(), even when its quite unsuitable for ktime. -- To unsubscribe from this list: send the line "unsubscribe linux-sh" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 26 May 2009, john stultz wrote: > On Tue, 2009-05-26 at 16:38 +0200, Peter Zijlstra wrote: > > Added the generic clock and timer folks to CC. > > > > On Tue, 2009-05-26 at 16:31 +0200, Linus Walleij wrote: > > > 2009/5/26 Paul Mundt <lethal@linux-sh.org>: > > > > > > > */ > > > > unsigned long long __attribute__((weak)) sched_clock(void) > > > > { > > > > + /* > > > > + * Use the current clocksource when it becomes available later in > > > > + * the boot process, and ensure that it has a high enough rating > > > > + * to make it suitable for general use. > > > > + */ > > > > + if (clock && clock->rating >= 100) > > > > + return cyc2ns(clock, clocksource_read(clock)); > > I'm not super familiar with the recent sched_clock changes, but how will > this work if the clocksource wraps (ACPI PM wraps every 2-5 seconds). You don't want to use ACPI PM for sched_clock, never ever. > Also there's no locking here, so the clocksource could change under you. > > Further, checking for rating being greater then 100 really doesn't mean > anything. Probably need to check if the clocksource is continuous > instead. I'd like to have an explicit flag for this, so we can avoid that stuff like pmtimer and other slow access clock sources are used. > Overall, I'd probably suggest thinking this through a bit more. At some > point doing this right will cause sched_clock() to be basically the same > as ktime_get(). So why not just use that instead of remaking it? ktime_get() involves xtime lock and the scheduler does not care about a slightly wrong value. sched_clock does not have the accuracy requirements of time keeping, If we have an explicit flag we can replace lots of arch/embedded sched_clock implementations with a generic one which is a Good Thing. There is a world beside the broken x86 timers :) Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-sh" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2009-05-26 at 22:30 +0200, Peter Zijlstra wrote: > On Tue, 2009-05-26 at 13:23 -0700, john stultz wrote: > > Overall, I'd probably suggest thinking this through a bit more. At some > > point doing this right will cause sched_clock() to be basically the same > > as ktime_get(). So why not just use that instead of remaking it? > > simply because we don't require the strict global monotonicy for > scheduling as we do from a regular time source (its nice to have > though). > > That means that on x86 we can always use TSC for sched_clock(), even > when its quite unsuitable for ktime. Right, but I guess what I'm asking is can this be a bit better defined? If we are going to use clocksources (or cyclecounters - an area I need to clean up soon), it would be good to get an idea of what is expected of the sched_clock() interface. So TSC good, HPET bad. Why? Is latency all we care about? How bad would the TSC have to be before we wouldn't want to use it? thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-sh" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2009-05-26 at 13:40 -0700, john stultz wrote: > On Tue, 2009-05-26 at 22:30 +0200, Peter Zijlstra wrote: > > On Tue, 2009-05-26 at 13:23 -0700, john stultz wrote: > > > Overall, I'd probably suggest thinking this through a bit more. At some > > > point doing this right will cause sched_clock() to be basically the same > > > as ktime_get(). So why not just use that instead of remaking it? > > > > simply because we don't require the strict global monotonicy for > > scheduling as we do from a regular time source (its nice to have > > though). > > > > That means that on x86 we can always use TSC for sched_clock(), even > > when its quite unsuitable for ktime. > > Right, but I guess what I'm asking is can this be a bit better defined? > > If we are going to use clocksources (or cyclecounters - an area I need > to clean up soon), it would be good to get an idea of what is expected > of the sched_clock() interface. > > So TSC good, HPET bad. Why? Because TSC is a few cycles to read, and you can factorize a largish prime while doing an HPET read :-) > Is latency all we care about? How bad would > the TSC have to be before we wouldn't want to use it? Anything better than jiffies ;-) For sched_clock() we want something high-res that is monotonic per cpu and has a bounded drift between cpus in the order of jiffies. Look at kernel/sched_clock.c for what we do to make really shitty TSC conform to the above requirements. -- To unsubscribe from this list: send the line "unsubscribe linux-sh" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2009-05-26 at 22:55 +0200, Peter Zijlstra wrote: > On Tue, 2009-05-26 at 13:40 -0700, john stultz wrote: > > On Tue, 2009-05-26 at 22:30 +0200, Peter Zijlstra wrote: > > > On Tue, 2009-05-26 at 13:23 -0700, john stultz wrote: > > > > Overall, I'd probably suggest thinking this through a bit more. At some > > > > point doing this right will cause sched_clock() to be basically the same > > > > as ktime_get(). So why not just use that instead of remaking it? > > > > > > simply because we don't require the strict global monotonicy for > > > scheduling as we do from a regular time source (its nice to have > > > though). > > > > > > That means that on x86 we can always use TSC for sched_clock(), even > > > when its quite unsuitable for ktime. > > > > Right, but I guess what I'm asking is can this be a bit better defined? > > > > If we are going to use clocksources (or cyclecounters - an area I need > > to clean up soon), it would be good to get an idea of what is expected > > of the sched_clock() interface. > > > > So TSC good, HPET bad. Why? > > Because TSC is a few cycles to read, and you can factorize a largish > prime while doing an HPET read :-) > > > Is latency all we care about? How bad would > > the TSC have to be before we wouldn't want to use it? > > Anything better than jiffies ;-) Except HPET thought, right? :) > For sched_clock() we want something high-res that is monotonic per cpu > and has a bounded drift between cpus in the order of jiffies. > > Look at kernel/sched_clock.c for what we do to make really shitty TSC > conform to the above requirements. Sure, I guess what I'm trying to pull out here is that should we try to create some OK_FOR_SCHED_CLOCK flag for clocksources, and then we try to make this generic so other arches can add that flag and be done, what is the guidance we want to give to arch maintainers for setting that flag? 1) Has to be very very fast. Can we put a number on this? 50ns to read? 2) How long does it have to be monotonic for? Is it ok if it wraps every few seconds? If get_cycles() || jiffies is what we want, then lets leave it there. I just want to avoid mixing the clocksource code into the sched clock code until we really get this sort of definition sorted. thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-sh" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Isn't rdtsc inherently dependent on CPU tick rate, which is somewhat variable? Shouldn't HPET be more reliable? Or can constant_tsc be relied on for important things like scheduler? Thanks Jayant -----Original Message----- From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of john stultz Sent: Tuesday, May 26, 2009 4:01 PM To: Peter Zijlstra Cc: Linus Walleij; Paul Mundt; Ingo Molnar; Andrew Victor; Haavard Skinnemoen; Andrew Morton; linux-kernel@vger.kernel.org; linux-sh@vger.kernel.org; linux-arm-kernel@lists.arm.linux.org.uk; John Stultz; Thomas Gleixner Subject: Re: [PATCH] sched: Support current clocksource handling in fallback sched_clock(). On Tue, 2009-05-26 at 22:55 +0200, Peter Zijlstra wrote: > On Tue, 2009-05-26 at 13:40 -0700, john stultz wrote: > > On Tue, 2009-05-26 at 22:30 +0200, Peter Zijlstra wrote: > > > On Tue, 2009-05-26 at 13:23 -0700, john stultz wrote: > > > > Overall, I'd probably suggest thinking this through a bit more. At some > > > > point doing this right will cause sched_clock() to be basically the same > > > > as ktime_get(). So why not just use that instead of remaking it? > > > > > > simply because we don't require the strict global monotonicy for > > > scheduling as we do from a regular time source (its nice to have > > > though). > > > > > > That means that on x86 we can always use TSC for sched_clock(), even > > > when its quite unsuitable for ktime. > > > > Right, but I guess what I'm asking is can this be a bit better defined? > > > > If we are going to use clocksources (or cyclecounters - an area I need > > to clean up soon), it would be good to get an idea of what is expected > > of the sched_clock() interface. > > > > So TSC good, HPET bad. Why? > > Because TSC is a few cycles to read, and you can factorize a largish > prime while doing an HPET read :-) > > > Is latency all we care about? How bad would > > the TSC have to be before we wouldn't want to use it? > > Anything better than jiffies ;-) Except HPET thought, right? :) > For sched_clock() we want something high-res that is monotonic per cpu > and has a bounded drift between cpus in the order of jiffies. > > Look at kernel/sched_clock.c for what we do to make really shitty TSC > conform to the above requirements. Sure, I guess what I'm trying to pull out here is that should we try to create some OK_FOR_SCHED_CLOCK flag for clocksources, and then we try to make this generic so other arches can add that flag and be done, what is the guidance we want to give to arch maintainers for setting that flag? 1) Has to be very very fast. Can we put a number on this? 50ns to read? 2) How long does it have to be monotonic for? Is it ok if it wraps every few seconds? If get_cycles() || jiffies is what we want, then lets leave it there. I just want to avoid mixing the clocksource code into the sched clock code until we really get this sort of definition sorted. thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-sh" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 26 May 2009, john stultz wrote: w> > > Is latency all we care about? How bad would > > > the TSC have to be before we wouldn't want to use it? > > > > Anything better than jiffies ;-) > > Except HPET thought, right? :) Eeek. HPET access serializes across cores. That's orders of magnitudes worse than the granularity loss of jiffies. > > For sched_clock() we want something high-res that is monotonic per cpu > > and has a bounded drift between cpus in the order of jiffies. > > > > Look at kernel/sched_clock.c for what we do to make really shitty TSC > > conform to the above requirements. > > Sure, I guess what I'm trying to pull out here is that should we try to > create some OK_FOR_SCHED_CLOCK flag for clocksources, and then we try to > make this generic so other arches can add that flag and be done, what is > the guidance we want to give to arch maintainers for setting that flag? > > 1) Has to be very very fast. Can we put a number on this? 50ns to read? > > 2) How long does it have to be monotonic for? Is it ok if it wraps every > few seconds? > > If get_cycles() || jiffies is what we want, then lets leave it there. I > just want to avoid mixing the clocksource code into the sched clock code > until we really get this sort of definition sorted. The criterion is simple. If arch maintainer decides that the access to the particular clock source is the best compromise between granularity and access speed vs. jiffies then he can express that by setting the flag. Where is the difference between that flag and an arch specific sched_clock() implementation which overrides the weak generic one ? That arch specific function will probably do the same thing: return cyc2ns(whathever_clocksource_the_maintainer_thinks_is_the_best); So let's get rid of those sched_clock() overrrides which do nothing else than the generic version. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-sh" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 26 May 2009, Mangalampalli, JayantX wrote: 1) please do not top post. 2) please fix your mail client to do proper line breaks at ~78 chars (see http://www.tux.org/lkml/) > Isn't rdtsc inherently dependent on CPU tick rate, which is somewhat > variable? Shouldn't HPET be more reliable? Or can constant_tsc be > relied on for important things like scheduler? 1) Please understand that there is an universe which has useful timer implementations. i.e. almost everything except arch/x86 2) rdtsc is not inherently dependent on CPU frequency. It just depends on the CPU frequency for most of the CPU implementations which have the ability to change the CPU core frequency. Newer CPUs have core frequency invariant TSC implementations, but they are not perfect yet as the TSC stops in deeper C-states. 3) HPET is reliable but the hardware access to HPET is way more expensive than the access to TSC. Also it does not scale on SMP machines as the HPET access is serialized and worse than a global lock. So there is a damned good reason why we went there to add the horror of sched_clock.c to utilize TSC for hot code pathes. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-sh" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2009-05-26 at 16:00 -0700, john stultz wrote: > Sure, I guess what I'm trying to pull out here is that should we try to > create some OK_FOR_SCHED_CLOCK flag for clocksources, and then we try to > make this generic so other arches can add that flag and be done, what is > the guidance we want to give to arch maintainers for setting that flag? > > 1) Has to be very very fast. Can we put a number on this? 50ns to read? I'd express it in cpu cycles, but not sure, the very fastest the arch has :-) Just tell people this is in their scheduling hot-path and they might understand. > 2) How long does it have to be monotonic for? Forever? (per cpu) > Is it ok if it wraps every few seconds? No, if it wraps it needs to wrap on u64. -- To unsubscribe from this list: send the line "unsubscribe linux-sh" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/kernel/sched_clock.c b/kernel/sched_clock.c index e1d16c9..59bbeeb 100644 --- a/kernel/sched_clock.c +++ b/kernel/sched_clock.c @@ -30,6 +30,7 @@ #include <linux/percpu.h> #include <linux/ktime.h> #include <linux/sched.h> +#include <linux/clocksource.h> /* * Scheduler clock - returns current time in nanosec units. @@ -38,6 +39,15 @@ */ unsigned long long __attribute__((weak)) sched_clock(void) { + /* + * Use the current clocksource when it becomes available later in + * the boot process, and ensure that it has a high enough rating + * to make it suitable for general use. + */ + if (clock && clock->rating >= 100) + return cyc2ns(clock, clocksource_read(clock)); + + /* Otherwise just fall back on jiffies */ return (unsigned long long)(jiffies - INITIAL_JIFFIES) * (NSEC_PER_SEC / HZ); }
There are presently a number of issues and limitations with how the clocksource and sched_clock() interaction works today. Configurations tend to be grouped in to one of the following: - Platform provides a low rated clocksource (< 100) and prefers to use jiffies for sched_clock() due to reliability concerns. - Platform provides its own clocksource and sched_clock() that wraps in to it. - Platform uses a generic clocksource (ie, drivers/clocksource/) combined with the generic jiffies-backed sched_clock(). - Platform supports a generic highly-rated clocksource but ends up having to use the jiffies sched_clock() anyways. - Platform supports multiple highly-rated clocksources. In the first case, simply using the rating information is sufficient to figure out the proper course of action. In the second case, very few of these do anything outside of the regular cyc2ns() work on the preferred clocksource, so it tends to be more about having access to the reference clocksource data structures more than really wanting to do any special calculations in sched_clock(). The last few cases are presently what we are faced with on sh, and which also impacts other drivers/clocksource drivers (while acpi_pm seems to have alternate recourse for sched_clock(), ARM/AVR32/SH do not). In these cases multiple clocksources can be provided, and the availability of these will often depend on runtime constraints (pinmux and so forth), in which case link time determination is simply not sufficient. While these clocksources can be highly rated and can offer excellent granularity, the jiffies clocksource is still used as a fallback given the inability to sprinkle sched_clock() wrappers in the drivers themselves. Also, while sched_clock() could be moved in to struct clocksource itself, this does not help the case where sched_clock() is called in to repeatedly well before a preferred clocksource has been determined and made available (printk times and so on), so extra logic is needed regardless. This patch does the only thing I could think of to address most of these in one shot, abusing the current clocksource pointer and forcing sched_clock() to read from it directly as soon as it becomes available (and assuming that is is rated highly enough). This does add the cost of the rating test on systems that only have the jiffies clocksource, but I think this is acceptable collateral damage given that jiffies are not very granular to begin with. Signed-off-by: Paul Mundt <lethal@linux-sh.org> --- kernel/sched_clock.c | 10 ++++++++++ 1 file changed, 10 insertions(+) -- To unsubscribe from this list: send the line "unsubscribe linux-sh" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html