Message ID | 1399504982-31181-1-git-send-email-dianders@chromium.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Fixing Rafael's id. On Thu, May 8, 2014 at 4:53 AM, Doug Anderson <dianders@chromium.org> wrote: > Downscaling loops_per_jiffy on SMP ARM systems really doesn't work. > You could really only do this if: > > * Each CPU is has independent frequency changes (changing one CPU > doesn't affect another). > * We change the generic ARM udelay() code to actually look at percpu > loops_per_jiffy. There is one more case I believe: All CPUs share a single clock line and generic udelay() uses global loops_per_jiffy, as it would never happen that some other CPU is running faster and udelay would complete early > I don't know of any ARM CPUs that are totally independent that don't > just use a timer-based delay anyway. For those that don't have a > timer-based delay, we should be conservative and overestimate > loops_per_jiffy. > > Note that on some systems you might sometimes see (in the extreme case > when we're all the way downclocked) a udelay(100) become a > udelay(1000) now. > > Signed-off-by: Doug Anderson <dianders@chromium.org> > diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c > index 7c4fada..9d944f6 100644 > --- a/arch/arm/kernel/smp.c > +++ b/arch/arm/kernel/smp.c > @@ -649,39 +649,50 @@ int setup_profiling_timer(unsigned int multiplier) > > #ifdef CONFIG_CPU_FREQ > > -static DEFINE_PER_CPU(unsigned long, l_p_j_ref); > -static DEFINE_PER_CPU(unsigned long, l_p_j_ref_freq); > static unsigned long global_l_p_j_ref; > static unsigned long global_l_p_j_ref_freq; > +static unsigned long global_l_p_j_max_freq; > + > +/** > + * cpufreq_callback - Adjust loops_per_jiffies when frequency changes > + * > + * When the CPU frequency changes we need to adjust loops_per_jiffies, which > + * we assume scales linearly with frequency. > + * > + * This function is fairly castrated and only ever adjust loops_per_jiffies > + * upward. It also doesn't adjust the PER_CPU loops_per_jiffies. Here's why: > + * 1. The ARM udelay only ever looks at the global loops_per_jiffy not the > + * percpu one. If your CPUs _are not_ changed in lockstep you could run > + * into problems by decreasing loops_per_jiffies since one of the other > + * processors might still be running slower. > + * 2. The ARM udelay reads the loops_per_jiffy at the beginning of its loop and > + * no other times. If your CPUs _are_ changed in lockstep you could run > + * into a race where one CPU has started its loop with old (slower) > + * loops_per_jiffy and then suddenly is running faster. > + * > + * Anyone who wants a good udelay() should be using a timer-based solution > + * anyway. If you don't have a timer solution, you just gotta be conservative. > + */ > > static int cpufreq_callback(struct notifier_block *nb, > unsigned long val, void *data) > { > struct cpufreq_freqs *freq = data; > - int cpu = freq->cpu; > > if (freq->flags & CPUFREQ_CONST_LOOPS) > return NOTIFY_OK; > > - if (!per_cpu(l_p_j_ref, cpu)) { > - per_cpu(l_p_j_ref, cpu) = > - per_cpu(cpu_data, cpu).loops_per_jiffy; > - per_cpu(l_p_j_ref_freq, cpu) = freq->old; > - if (!global_l_p_j_ref) { > - global_l_p_j_ref = loops_per_jiffy; > - global_l_p_j_ref_freq = freq->old; > - } > + if (!global_l_p_j_ref) { > + global_l_p_j_ref = loops_per_jiffy; > + global_l_p_j_ref_freq = freq->old; > + global_l_p_j_max_freq = freq->old; > } > > - if ((val == CPUFREQ_PRECHANGE && freq->old < freq->new) || > - (val == CPUFREQ_POSTCHANGE && freq->old > freq->new)) { > + if (freq->new > global_l_p_j_max_freq) { > loops_per_jiffy = cpufreq_scale(global_l_p_j_ref, > global_l_p_j_ref_freq, > freq->new); > - per_cpu(cpu_data, cpu).loops_per_jiffy = > - cpufreq_scale(per_cpu(l_p_j_ref, cpu), > - per_cpu(l_p_j_ref_freq, cpu), > - freq->new); > + global_l_p_j_max_freq = freq->new; > } > return NOTIFY_OK; > } So, the end of this will happen when a CPU is set to its highest frequency for the first time. After that this routine will simply be noise and nothing more. Isn't it? And if that's the case, why don't we get rid of it completely and set global-loop-per-jiffy for the max freq at boot ? -- viresh
Viresh, On Thu, May 8, 2014 at 3:41 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote: > Fixing Rafael's id. > > On Thu, May 8, 2014 at 4:53 AM, Doug Anderson <dianders@chromium.org> wrote: >> Downscaling loops_per_jiffy on SMP ARM systems really doesn't work. >> You could really only do this if: >> >> * Each CPU is has independent frequency changes (changing one CPU >> doesn't affect another). >> * We change the generic ARM udelay() code to actually look at percpu >> loops_per_jiffy. > > There is one more case I believe: > All CPUs share a single clock line and generic udelay() uses global > loops_per_jiffy, as it would never happen that some other CPU is running > faster and udelay would complete early I'm not sure I understood your case. Maybe you can give me an example? I can give you an example myself of the single clock case and you can tell me how your example differs: 1. Initially CPU1 and CPU2 at 200MHz. Pretend loops_per_jiffy is 1000. 2. CPU1 starts a delay. It reads global lpj (1000) and sets up its local registers up for the loop. 3. At the same time, CPU2 is transitioning the system to 2000MHz. Right after CPU1 reads lpj CPU2 stores it as 10000. 4. Now CPU1 and CPU2 are running at 2000MHz but CPU1 is only looping 1000 times. It will complete too fast. ...you could possibly try to account for this in the delay loop code (being careful to handle all of the corner cases and races). ...or we could make the delay loop super conservative and suggest that people should be using a real timer. > So, the end of this will happen when a CPU is set to its highest frequency > for the first time. After that this routine will simply be noise and nothing > more. Isn't it? Yup. > And if that's the case, why don't we get rid of it completely and set > global-loop-per-jiffy for the max freq at boot ? How exactly do you do this in a generic way? I know that our systems don't always boot up at full speed. The HP Chromebook 11 might boot up at 900MHz and stay that way for a while until the battery gets enough juice. The Samsung Chromebook 2 will boot up at 1.8GHz although some of them can go to 1.9GHz and others to 2.0GHz. Waiting to actually see the cpufreq transition is a safe way, though it does end up with some extra function calls. -Doug
On Thu, 8 May 2014, Doug Anderson wrote: > 1. Initially CPU1 and CPU2 at 200MHz. Pretend loops_per_jiffy is 1000. > > 2. CPU1 starts a delay. It reads global lpj (1000) and sets up its > local registers up for the loop. > > 3. At the same time, CPU2 is transitioning the system to 2000MHz. > Right after CPU1 reads lpj CPU2 stores it as 10000. > > 4. Now CPU1 and CPU2 are running at 2000MHz but CPU1 is only looping > 1000 times. It will complete too fast. > > ...you could possibly try to account for this in the delay loop code > (being careful to handle all of the corner cases and races). ...or we > could make the delay loop super conservative and suggest that people > should be using a real timer. I don't see how you can possibly solve this issue without a timer based delay. Even if you scale the loop count in only one direction, it will still have this problem even though the window for the race would happen much less often. Yet having a delay which is way longer than expected might cause problems in some cases. Yet clock frequency changes with the kind of magnitude you give in your example are usually not instantaneous. Given udelay should be used for very short delays, it is likely that CPU1 will complete its count before the higher clock frequency is effective in most cases. > How exactly do you do this in a generic way? I know that our systems > don't always boot up at full speed. The HP Chromebook 11 might boot > up at 900MHz and stay that way for a while until the battery gets > enough juice. The Samsung Chromebook 2 will boot up at 1.8GHz > although some of them can go to 1.9GHz and others to 2.0GHz. Waiting > to actually see the cpufreq transition is a safe way, though it does > end up with some extra function calls. The Samsung Chromebook uses an A15 which does use a timer based udelay. What about the others? IOW is this a real problem in practice or a theoretical one? SMP with shared clock for DVFS simply doesn't allow pure loop counts to always be accurate. Trying to fix a broken implementation with something that is still broken to some extent, and maybe more in some cases, doesn't look like much progress to me. Nicolas
Nicolas, On Thu, May 8, 2014 at 9:04 AM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote: > On Thu, 8 May 2014, Doug Anderson wrote: > >> 1. Initially CPU1 and CPU2 at 200MHz. Pretend loops_per_jiffy is 1000. >> >> 2. CPU1 starts a delay. It reads global lpj (1000) and sets up its >> local registers up for the loop. >> >> 3. At the same time, CPU2 is transitioning the system to 2000MHz. >> Right after CPU1 reads lpj CPU2 stores it as 10000. >> >> 4. Now CPU1 and CPU2 are running at 2000MHz but CPU1 is only looping >> 1000 times. It will complete too fast. >> >> ...you could possibly try to account for this in the delay loop code >> (being careful to handle all of the corner cases and races). ...or we >> could make the delay loop super conservative and suggest that people >> should be using a real timer. > > I don't see how you can possibly solve this issue without a timer based > delay. Even if you scale the loop count in only one direction, it will > still have this problem even though the window for the race would happen > much less often. Yet having a delay which is way longer than expected > might cause problems in some cases. You could possibly try to do something excessively clever by checking the loops per jiffy after the loop was done (and perhaps register for cpufreq changes so you know if it changed and then changed back)? As I said, I don't think it's a good use of anyone's time. Longer delays aren't very good, but IMHO having some delays of 100 => 1000 is better than having delays of 100 => 75. The former will cause mostly performance problems and the later will cause real correctness problems. I'm not saying that 100 => 1000 is good, it's just less bad. Specifically even in a timer-based system you can't guarantee that a udelay(100) won't end up a udelay(1000) if the kernel finds something better to do than to run your code. I agree that there might be code that breaks when a udelay(100) becomes a udelay(1000), but probably that code needs to be fixed to be more tolerant anyway. When you've got a udelay(100) => udelay(75) then suddenly you're returning from regulator code before the regulator has fully ramped up to its final voltage. You're talking to peripherals faster than they were intended to be talked to. ...etc. > Yet clock frequency changes with the kind of magnitude you give in your > example are usually not instantaneous. Given udelay should be used for > very short delays, it is likely that CPU1 will complete its count before > the higher clock frequency is effective in most cases. In practice we found that the frequency changed fast enough in a real system that we were seeing real problems, at least with a udelay(50). If you've got a system using lpj-based udelay() that supports cpufreq, grab David's test patches and try for yourself. Perhaps I was mistaken? >> How exactly do you do this in a generic way? I know that our systems >> don't always boot up at full speed. The HP Chromebook 11 might boot >> up at 900MHz and stay that way for a while until the battery gets >> enough juice. The Samsung Chromebook 2 will boot up at 1.8GHz >> although some of them can go to 1.9GHz and others to 2.0GHz. Waiting >> to actually see the cpufreq transition is a safe way, though it does >> end up with some extra function calls. > > The Samsung Chromebook uses an A15 which does use a timer based udelay. > What about the others? IOW is this a real problem in practice or a > theoretical one? Correct, the Samsung Chromebook uses a timer-based udelay upstream (it still doesn't in our tree BTW, but we're working to rectify that). ...that's why I didn't send the patch up initially. I sent it up at the request of John Stultz in a discussion about David's test code. > SMP with shared clock for DVFS simply doesn't allow pure loop counts to > always be accurate. Trying to fix a broken implementation with > something that is still broken to some extent, and maybe more in > some cases, doesn't look like much progress to me. I totally agree that this doesn't really fix the problem nicely which is why I didn't send it initially. I will make the argument that this patch makes things less broken overall on any systems that actually end up running this code, but if you want NAK it then it won't cause me any heartache. ;) -Doug
On Thu, 8 May 2014, Doug Anderson wrote: > Nicolas, > > On Thu, May 8, 2014 at 9:04 AM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote: > > On Thu, 8 May 2014, Doug Anderson wrote: > > > >> 1. Initially CPU1 and CPU2 at 200MHz. Pretend loops_per_jiffy is 1000. > >> > >> 2. CPU1 starts a delay. It reads global lpj (1000) and sets up its > >> local registers up for the loop. > >> > >> 3. At the same time, CPU2 is transitioning the system to 2000MHz. > >> Right after CPU1 reads lpj CPU2 stores it as 10000. > >> > >> 4. Now CPU1 and CPU2 are running at 2000MHz but CPU1 is only looping > >> 1000 times. It will complete too fast. > >> > >> ...you could possibly try to account for this in the delay loop code > >> (being careful to handle all of the corner cases and races). ...or we > >> could make the delay loop super conservative and suggest that people > >> should be using a real timer. > > > > I don't see how you can possibly solve this issue without a timer based > > delay. Even if you scale the loop count in only one direction, it will > > still have this problem even though the window for the race would happen > > much less often. Yet having a delay which is way longer than expected > > might cause problems in some cases. > > You could possibly try to do something excessively clever by checking > the loops per jiffy after the loop was done (and perhaps register for > cpufreq changes so you know if it changed and then changed back)? As > I said, I don't think it's a good use of anyone's time. Agreed. > Longer delays aren't very good, but IMHO having some delays of 100 => > 1000 is better than having delays of 100 => 75. The former will cause > mostly performance problems and the later will cause real correctness > problems. > I'm not saying that 100 => 1000 is good, it's just less bad. There might be some cases where precise timing is needed though. I thought I came across one such case in the past but I can't remember which. > Specifically even in a timer-based system you can't guarantee that a > udelay(100) won't end up a udelay(1000) if the kernel finds something > better to do than to run your code. I agree that there might be code > that breaks when a udelay(100) becomes a udelay(1000), but probably > that code needs to be fixed to be more tolerant anyway. The timer based udelay implementation does poll the timer, and if exact timing is important then you'll certainly turn off IRQs during the critical sequence. [...] > > SMP with shared clock for DVFS simply doesn't allow pure loop counts to > > always be accurate. Trying to fix a broken implementation with > > something that is still broken to some extent, and maybe more in > > some cases, doesn't look like much progress to me. > > I totally agree that this doesn't really fix the problem nicely which > is why I didn't send it initially. > > I will make the argument that this patch makes things less broken > overall on any systems that actually end up running this code, but if > you want NAK it then it won't cause me any heartache. ;) What I insist on is for this issue to be solved using a stable counter such a timer when available. It _is_ available on one of the target you mentioned so that is the solution you should add to your tree. Investigating a similar solution for your other target should be preferred to hacking the udelay loop. This way you're guaranteed to solve this problem fully. Nicolas
Nicolas, On Thu, May 8, 2014 at 10:43 AM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote: > On Thu, 8 May 2014, Doug Anderson wrote: >> Longer delays aren't very good, but IMHO having some delays of 100 => >> 1000 is better than having delays of 100 => 75. The former will cause >> mostly performance problems and the later will cause real correctness >> problems. >> I'm not saying that 100 => 1000 is good, it's just less bad. > > There might be some cases where precise timing is needed though. > I thought I came across one such case in the past but I can't remember > which. If precise timing is needed, shouldn't it be using ktime? >> I will make the argument that this patch makes things less broken >> overall on any systems that actually end up running this code, but if >> you want NAK it then it won't cause me any heartache. ;) > > What I insist on is for this issue to be solved using a stable counter > such a timer when available. It _is_ available on one of the target you > mentioned so that is the solution you should add to your tree. Yup, we're working on it. > Investigating a similar solution for your other target should be > preferred to hacking the udelay loop. This way you're guaranteed to > solve this problem fully. I have no other target in mind. I'm merely sending this up there just in case there is some cpufreq running ARM board that is SMP and has no timer-based udelay. Those are the only boards that could possibly be running this code anyway. I guess I would say that my patch is unhacking the this code. The code after my patch is simpler. I would perhaps argue that (ec971ea ARM: add cpufreq transiton notifier to adjust loops_per_jiffy for smp) should never have landed to begin with. -Doug
On Thu, May 08, 2014 at 01:43:29PM -0400, Nicolas Pitre wrote: > There might be some cases where precise timing is needed though. > I thought I came across one such case in the past but I can't remember > which. Anything which is expecting precise timings from udelay() is broken. Firstly, udelay() does _not_ guarantee to give you a delay of at least the requested period - it tries to give an _approximate_ delay. The first thing to realise is that loops_per_jiffy is calibrated with interrupts _on_, which means that the calculated loops_per_jiffy is the number of iterations in a jiffy _minus_ the time it takes for the timer interrupt to be processed. This means loops_per_jiffy will always be smaller than the number of loops that would be executed within the same period. This leads to udelay() always producing slightly shorter than requested delays - this is quite measurable. It gets worse when you consider the utter mess that the L2 cache code is in - where on many platforms we calibrate udelay() with the cache off, which results in loops_per_jiffy being smaller than it would otherwise be (meaning shorter delays.) So, that's two reasons there why loops_per_jiffy will be smaller than it should be at boot, and therefore udelay() will be inaccurate. Another reason udelay() can be unaccurate is if interrupts are on, and you have USB present. USB interrupt processing can take on the order of 10s of milliseconds even on 800MHz or faster ARM CPUs. If you receive one of those mid-udelay(), your CPU will be occupied elsewhere. Another reason is preempt. If the kernel can preempt during udelay(), your delay will also be much longer than you requested. No, disabling preemption in udelay() is not on, that would increase preemption latency. So, the only /real/ solution if you want proper delays is for udelay() to use a timer or counter, and this is should always the preferred method where it's available. Quite rightly, we're not hacking udelay() stuff to work around not having that, or if someone configures it out.
On Thu, 8 May 2014, Doug Anderson wrote: > Nicolas, > > On Thu, May 8, 2014 at 10:43 AM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote: > > On Thu, 8 May 2014, Doug Anderson wrote: > > >> Longer delays aren't very good, but IMHO having some delays of 100 => > >> 1000 is better than having delays of 100 => 75. The former will cause > >> mostly performance problems and the later will cause real correctness > >> problems. > >> I'm not saying that 100 => 1000 is good, it's just less bad. > > > > There might be some cases where precise timing is needed though. > > I thought I came across one such case in the past but I can't remember > > which. > > If precise timing is needed, shouldn't it be using ktime? I meant "precise" in the sense that you might have to poke at some hardware within some deadline e.g. set bit, wait 13us, clear bit, and not exceed 15us between both bit events. Arguably this is best hangled with actual FIQs (when they're available). But I don't have any actual case of this to bring as example. > >> I will make the argument that this patch makes things less broken > >> overall on any systems that actually end up running this code, but if > >> you want NAK it then it won't cause me any heartache. ;) > > > > What I insist on is for this issue to be solved using a stable counter > > such a timer when available. It _is_ available on one of the target you > > mentioned so that is the solution you should add to your tree. > > Yup, we're working on it. > > > > Investigating a similar solution for your other target should be > > preferred to hacking the udelay loop. This way you're guaranteed to > > solve this problem fully. > > I have no other target in mind. I'm merely sending this up there just > in case there is some cpufreq running ARM board that is SMP and has no > timer-based udelay. Those are the only boards that could possibly be > running this code anyway. > > I guess I would say that my patch is unhacking the this code. The > code after my patch is simpler. I would perhaps argue that (ec971ea > ARM: add cpufreq transiton notifier to adjust loops_per_jiffy for smp) > should never have landed to begin with. I would agree to qualify your patch as brokenness mitigation and that might be better than nothing. But this should probably print a warning to that effect. Yet, wouldn't using ktime to implement udelay() a better solution in that case? Suffice to make sure a clock source with microsecs resolution is available. Nicolas
On Thu, 8 May 2014, Russell King - ARM Linux wrote: > Anything which is expecting precise timings from udelay() is broken. > Firstly, udelay() does _not_ guarantee to give you a delay of at least > the requested period - it tries to give an _approximate_ delay. > > The first thing to realise is that loops_per_jiffy is calibrated with > interrupts _on_, which means that the calculated loops_per_jiffy is > the number of iterations in a jiffy _minus_ the time it takes for the > timer interrupt to be processed. This means loops_per_jiffy will > always be smaller than the number of loops that would be executed > within the same period. > > This leads to udelay() always producing slightly shorter than > requested delays - this is quite measurable. OK, this is certainly bad. Hopefully it won't be that far off like it would when the CPU is in the middle of a clock freq transition. > It gets worse when you consider the utter mess that the L2 cache code > is in - where on many platforms we calibrate udelay() with the cache > off, which results in loops_per_jiffy being smaller than it would > otherwise be (meaning shorter delays.) > > So, that's two reasons there why loops_per_jiffy will be smaller than > it should be at boot, and therefore udelay() will be inaccurate. > > Another reason udelay() can be unaccurate is if interrupts are on, and > you have USB present. USB interrupt processing can take on the order > of 10s of milliseconds even on 800MHz or faster ARM CPUs. If you > receive one of those mid-udelay(), your CPU will be occupied elsewhere. > > Another reason is preempt. If the kernel can preempt during udelay(), > your delay will also be much longer than you requested. No, disabling > preemption in udelay() is not on, that would increase preemption > latency. > > So, the only /real/ solution if you want proper delays is for udelay() > to use a timer or counter, and this is should always the preferred > method where it's available. Quite rightly, we're not hacking udelay() > stuff to work around not having that, or if someone configures it out. What about using a default based on ktime_get(), or even sched_clock(), when SMP and cpufreq are configured in? Nicolas
On 05/08/2014 01:12 PM, Nicolas Pitre wrote: > On Thu, 8 May 2014, Russell King - ARM Linux wrote: > >> So, the only /real/ solution if you want proper delays is for udelay() >> to use a timer or counter, and this is should always the preferred >> method where it's available. Quite rightly, we're not hacking udelay() >> stuff to work around not having that, or if someone configures it out. > What about using a default based on ktime_get(), or even sched_clock(), > when SMP and cpufreq are configured in? While somewhat rare these days, there are some systems that still only have tick(jiffies)-granular time (as well as sched_clock), which might be far too coarse for udelay, and in addition depends on irqs being enabled to be able for time to progress. The loops based delay is sort of a hack that allows fine grained (but somewhat inaccurate) time delays in just about any context. The problem being the loop based delays are somewhat fragile, particularly as many of the assumptions for that code have been broken (cpufreq scaling, ASMP, SMIs or other calibration issues, etc). On these systems where the assumptions are broken, time-based counter delays are needed, but not always in place. Thus I suspect we need something that provides proper warnings when we are using loop delays on system that breaks its assumptions. thanks -john
On Thu, May 08, 2014 at 04:12:14PM -0400, Nicolas Pitre wrote: > On Thu, 8 May 2014, Russell King - ARM Linux wrote: > > > Anything which is expecting precise timings from udelay() is broken. > > Firstly, udelay() does _not_ guarantee to give you a delay of at least > > the requested period - it tries to give an _approximate_ delay. > > > > The first thing to realise is that loops_per_jiffy is calibrated with > > interrupts _on_, which means that the calculated loops_per_jiffy is > > the number of iterations in a jiffy _minus_ the time it takes for the > > timer interrupt to be processed. This means loops_per_jiffy will > > always be smaller than the number of loops that would be executed > > within the same period. > > > > This leads to udelay() always producing slightly shorter than > > requested delays - this is quite measurable. > > OK, this is certainly bad. Hopefully it won't be that far off like it > would when the CPU is in the middle of a clock freq transition. It depends on the system, but my point is that assumption that udelay() gives a delay of at least the requested time is something that is false, and has *always* been false. It's not "broken" either - it's just how the thing works, and the "fix" for it is to use a timer based implementation which isn't affected by interrupts. > > So, the only /real/ solution if you want proper delays is for udelay() > > to use a timer or counter, and this is should always the preferred > > method where it's available. Quite rightly, we're not hacking udelay() > > stuff to work around not having that, or if someone configures it out. > > What about using a default based on ktime_get(), or even sched_clock(), > when SMP and cpufreq are configured in? I see no reason to play those kinds of games. Keep the message simple. If you're in a preempt or SMP environment, provide a timer for udelay(). IF you're in an environment with IRQs which can take a long time, use a timer for udelay(). If you're in an environment where the CPU clock can change unexpectedly, use a timer for udelay(). The very last thing we want to do is to sit around doing expensive calls into various time keeping code which themselves add conversion on top, which then end up making udelay() latency even worse than the loop-based versions on slower machines. So... the message is nice and simple: where possible, use a timer for udelay().
On Thu, May 08, 2014 at 11:06:24AM -0700, Doug Anderson wrote: > I guess I would say that my patch is unhacking the this code. The > code after my patch is simpler. I would perhaps argue that (ec971ea > ARM: add cpufreq transiton notifier to adjust loops_per_jiffy for smp) > should never have landed to begin with. That depends on your point of view. As I've already pointed out through the examples of why udelay() is inaccurate, for driver authors, they should assume that udelay() just gives you an "approximate" delay and it has no accuracy. When you start from that point, rather than (as you seem to be) believing that it has some kind of accuracy, then the implementation of scaling the loops_per_jiffy is entirely reasonable - it's a best effort implementation. Again, just use a timer for your udelay() implementation.
Russel, On Thu, May 8, 2014 at 1:55 PM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Thu, May 08, 2014 at 11:06:24AM -0700, Doug Anderson wrote: >> I guess I would say that my patch is unhacking the this code. The >> code after my patch is simpler. I would perhaps argue that (ec971ea >> ARM: add cpufreq transiton notifier to adjust loops_per_jiffy for smp) >> should never have landed to begin with. > > That depends on your point of view. As I've already pointed out through > the examples of why udelay() is inaccurate, for driver authors, they > should assume that udelay() just gives you an "approximate" delay and > it has no accuracy. That disagrees with what Thomas Gleixner says at <http://lkml.iu.edu//hypermail/linux/kernel/1203.1/01034.html>. It also seems like perhaps the regulator core is broken, then... If a udelay(30) can end up as a udelay(20) then we may return from a regulator code 10us earlier than we should and we'll assume that a regulator is ramped before it really is... I'm out tomorrow but I can confirm on Monday that I was really seeing udelay(30) be a udelay(20) without this patch.
On Thu, May 08, 2014 at 05:02:02PM -0700, Doug Anderson wrote: > Russel, > > On Thu, May 8, 2014 at 1:55 PM, Russell King - ARM Linux > <linux@arm.linux.org.uk> wrote: > > On Thu, May 08, 2014 at 11:06:24AM -0700, Doug Anderson wrote: > >> I guess I would say that my patch is unhacking the this code. The > >> code after my patch is simpler. I would perhaps argue that (ec971ea > >> ARM: add cpufreq transiton notifier to adjust loops_per_jiffy for smp) > >> should never have landed to begin with. > > > > That depends on your point of view. As I've already pointed out through > > the examples of why udelay() is inaccurate, for driver authors, they > > should assume that udelay() just gives you an "approximate" delay and > > it has no accuracy. > > That disagrees with what Thomas Gleixner says at > <http://lkml.iu.edu//hypermail/linux/kernel/1203.1/01034.html>. It > also seems like perhaps the regulator core is broken, then... If a > udelay(30) can end up as a udelay(20) then we may return from a > regulator code 10us earlier than we should and we'll assume that a > regulator is ramped before it really is... > > I'm out tomorrow but I can confirm on Monday that I was really seeing > udelay(30) be a udelay(20) without this patch. Thomas is wrong - when I researched this topic, I ended up finding that udelay() does delay _less_ than requested, and I mailed Linus about it... This is whe way udelay() is - it's an approximate delay, it's *not* accurate. It's also fairly obvious when you stop and consider how it's calibrated. Take a moment to wonder when we used to recalibrate each CPU individally, why the boot CPU bogomips would be slightly lower than the secondary CPU bogomips. This is all down to the boot CPU having to run timer interrupts while the secondary CPUs weren't at the time they calibrated. Linus doesn't give a damn about udelay() being slightly short in this way. Neither do I. Let me repeat again: use a timer.
Russell, On Thu, May 8, 2014 at 5:23 PM, Russell King - ARM Linux <linux@arm.linux.org.uk> wrote: > On Thu, May 08, 2014 at 05:02:02PM -0700, Doug Anderson wrote: >> Russel, >> >> On Thu, May 8, 2014 at 1:55 PM, Russell King - ARM Linux >> <linux@arm.linux.org.uk> wrote: >> > On Thu, May 08, 2014 at 11:06:24AM -0700, Doug Anderson wrote: >> >> I guess I would say that my patch is unhacking the this code. The >> >> code after my patch is simpler. I would perhaps argue that (ec971ea >> >> ARM: add cpufreq transiton notifier to adjust loops_per_jiffy for smp) >> >> should never have landed to begin with. >> > >> > That depends on your point of view. As I've already pointed out through >> > the examples of why udelay() is inaccurate, for driver authors, they >> > should assume that udelay() just gives you an "approximate" delay and >> > it has no accuracy. >> >> That disagrees with what Thomas Gleixner says at >> <http://lkml.iu.edu//hypermail/linux/kernel/1203.1/01034.html>. It >> also seems like perhaps the regulator core is broken, then... If a >> udelay(30) can end up as a udelay(20) then we may return from a >> regulator code 10us earlier than we should and we'll assume that a >> regulator is ramped before it really is... >> >> I'm out tomorrow but I can confirm on Monday that I was really seeing >> udelay(30) be a udelay(20) without this patch. > > Thomas is wrong - when I researched this topic, I ended up finding > that udelay() does delay _less_ than requested, and I mailed Linus > about it... This is whe way udelay() is - it's an approximate delay, > it's *not* accurate. > > It's also fairly obvious when you stop and consider how it's calibrated. > > Take a moment to wonder when we used to recalibrate each CPU individally, > why the boot CPU bogomips would be slightly lower than the secondary > CPU bogomips. This is all down to the boot CPU having to run timer > interrupts while the secondary CPUs weren't at the time they calibrated. > > Linus doesn't give a damn about udelay() being slightly short in this > way. Neither do I. Can you define what you mean by "slightly"? I'm personally not super concerned by udelay(50) becoming udelay(49), but... I've got a test that hammers on udelay and tests it against ktime. It runs for _multiple minutes_ and give nearly spot on udelay(50) values. Then, it gets unlucky and gives something like this (values reported in nanoseconds): [ 467.034522] 0. want=50000, got=308708 [ 467.034522] 1. want=50000, got=51917 [ 467.034522] 2. want=50000, got=51167 [ 467.034522] 3. want=50000, got=76291 [ 467.034522] 4. want=50000, got=149959 [ 467.034522] 5. want=50000, got=76458 [ 467.034522] 6. want=50000, got=39583 [ 467.034522] 7. want=50000, got=19459 [ 467.034522] 8. want=50000, got=20625 [ 467.034522] 9. want=50000, got=20375 It's things like "want 50000, got 19459" that concern me. It's even more concerning that it only happens after multiple minutes, meaning that it will lead to random, impossible to reproduce bugs. NOTE: I haven't traced through to figure out exactly what scenario causes the above to happen. In my case I'm simply looping over some test code using <https://chromium-review.googlesource.com/#/c/189760/8/init/calibrate.c>. ...but if we stop screwing with loops_per_jiffy it definitely doesn't happen. > Let me repeat again: use a timer. Totally agree. We are using a timer going forward. I was posting this patch to help those less fortunate who happen to be running on a system where the timer isn't available for whatever reason. -Doug
On 8 May 2014 20:55, Doug Anderson <dianders@chromium.org> wrote: > On Thu, May 8, 2014 at 3:41 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote: >> And if that's the case, why don't we get rid of it completely and set >> global-loop-per-jiffy for the max freq at boot ? > > How exactly do you do this in a generic way? We could have requested for the highest possible frequency value from cpufreq and adjust accordingly at boot. But probably thread is turning in another direction now, so lets wait to see what others have in box.
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c index 7c4fada..9d944f6 100644 --- a/arch/arm/kernel/smp.c +++ b/arch/arm/kernel/smp.c @@ -649,39 +649,50 @@ int setup_profiling_timer(unsigned int multiplier) #ifdef CONFIG_CPU_FREQ -static DEFINE_PER_CPU(unsigned long, l_p_j_ref); -static DEFINE_PER_CPU(unsigned long, l_p_j_ref_freq); static unsigned long global_l_p_j_ref; static unsigned long global_l_p_j_ref_freq; +static unsigned long global_l_p_j_max_freq; + +/** + * cpufreq_callback - Adjust loops_per_jiffies when frequency changes + * + * When the CPU frequency changes we need to adjust loops_per_jiffies, which + * we assume scales linearly with frequency. + * + * This function is fairly castrated and only ever adjust loops_per_jiffies + * upward. It also doesn't adjust the PER_CPU loops_per_jiffies. Here's why: + * 1. The ARM udelay only ever looks at the global loops_per_jiffy not the + * percpu one. If your CPUs _are not_ changed in lockstep you could run + * into problems by decreasing loops_per_jiffies since one of the other + * processors might still be running slower. + * 2. The ARM udelay reads the loops_per_jiffy at the beginning of its loop and + * no other times. If your CPUs _are_ changed in lockstep you could run + * into a race where one CPU has started its loop with old (slower) + * loops_per_jiffy and then suddenly is running faster. + * + * Anyone who wants a good udelay() should be using a timer-based solution + * anyway. If you don't have a timer solution, you just gotta be conservative. + */ static int cpufreq_callback(struct notifier_block *nb, unsigned long val, void *data) { struct cpufreq_freqs *freq = data; - int cpu = freq->cpu; if (freq->flags & CPUFREQ_CONST_LOOPS) return NOTIFY_OK; - if (!per_cpu(l_p_j_ref, cpu)) { - per_cpu(l_p_j_ref, cpu) = - per_cpu(cpu_data, cpu).loops_per_jiffy; - per_cpu(l_p_j_ref_freq, cpu) = freq->old; - if (!global_l_p_j_ref) { - global_l_p_j_ref = loops_per_jiffy; - global_l_p_j_ref_freq = freq->old; - } + if (!global_l_p_j_ref) { + global_l_p_j_ref = loops_per_jiffy; + global_l_p_j_ref_freq = freq->old; + global_l_p_j_max_freq = freq->old; } - if ((val == CPUFREQ_PRECHANGE && freq->old < freq->new) || - (val == CPUFREQ_POSTCHANGE && freq->old > freq->new)) { + if (freq->new > global_l_p_j_max_freq) { loops_per_jiffy = cpufreq_scale(global_l_p_j_ref, global_l_p_j_ref_freq, freq->new); - per_cpu(cpu_data, cpu).loops_per_jiffy = - cpufreq_scale(per_cpu(l_p_j_ref, cpu), - per_cpu(l_p_j_ref_freq, cpu), - freq->new); + global_l_p_j_max_freq = freq->new; } return NOTIFY_OK; }
Downscaling loops_per_jiffy on SMP ARM systems really doesn't work. You could really only do this if: * Each CPU is has independent frequency changes (changing one CPU doesn't affect another). * We change the generic ARM udelay() code to actually look at percpu loops_per_jiffy. I don't know of any ARM CPUs that are totally independent that don't just use a timer-based delay anyway. For those that don't have a timer-based delay, we should be conservative and overestimate loops_per_jiffy. Note that on some systems you might sometimes see (in the extreme case when we're all the way downclocked) a udelay(100) become a udelay(1000) now. Signed-off-by: Doug Anderson <dianders@chromium.org> --- Note that I don't have an board that has cpufreq enabled upstream so I'm relying on the testing I did on our local kernel-3.8. Hopefully someone out there can test using David's nifty udelay tests. In order to see this you'd need to make sure that you _don't_ have arch timers enabled. See: * https://patchwork.kernel.org/patch/4124721/ * https://patchwork.kernel.org/patch/4124731/ arch/arm/kernel/smp.c | 45 ++++++++++++++++++++++++++++----------------- 1 file changed, 28 insertions(+), 17 deletions(-)