ARM: Don't ever downscale loops_per_jiffy in SMP systems

Message ID	1399504982-31181-1-git-send-email-dianders@chromium.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> From: Doug Anderson <dianders@chromium.org> To: Russell King <linux@arm.linux.org.uk>, Will Deacon <will.deacon@arm.com> Subject: [PATCH] ARM: Don't ever downscale loops_per_jiffy in SMP systems Date: Wed, 7 May 2014 16:23:02 -0700 Message-Id: <1399504982-31181-1-git-send-email-dianders@chromium.org> Cc: nicolas.pitre@linaro.org, David Riley <davidriley@chromium.org>, swarren@nvidia.com, marc.zyngier@arm.com, sboyd@codeaurora.org, Doug Anderson <dianders@chromium.org>, linux-kernel@vger.kernel.org, "Rafael J. Wysocki" <rjw@sisk.pl>, paul.gortmaker@windriver.com, Santosh Shilimkar <santosh.shilimkar@ti.com>, olof@lixom.net, John Stultz <john.stultz@linaro.org>, Shawn Guo <shawn.guo@linaro.org>, Sonny Rao <sonnyrao@chromium.org>, Richard Zhao <richard.zhao@linaro.org>, linux-arm-kernel@lists.infradead.org Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Doug Anderson May 7, 2014, 11:23 p.m. UTC

Downscaling loops_per_jiffy on SMP ARM systems really doesn't work.
You could really only do this if:

* Each CPU is has independent frequency changes (changing one CPU
  doesn't affect another).
* We change the generic ARM udelay() code to actually look at percpu
  loops_per_jiffy.

I don't know of any ARM CPUs that are totally independent that don't
just use a timer-based delay anyway.  For those that don't have a
timer-based delay, we should be conservative and overestimate
loops_per_jiffy.

Note that on some systems you might sometimes see (in the extreme case
when we're all the way downclocked) a udelay(100) become a
udelay(1000) now.

Signed-off-by: Doug Anderson <dianders@chromium.org>
---
Note that I don't have an board that has cpufreq enabled upstream so
I'm relying on the testing I did on our local kernel-3.8.  Hopefully
someone out there can test using David's nifty udelay tests.  In order
to see this you'd need to make sure that you _don't_ have arch timers
enabled.  See:
* https://patchwork.kernel.org/patch/4124721/
* https://patchwork.kernel.org/patch/4124731/

 arch/arm/kernel/smp.c | 45 ++++++++++++++++++++++++++++-----------------
 1 file changed, 28 insertions(+), 17 deletions(-)

Viresh Kumar May 8, 2014, 10:41 a.m. UTC | #1

Fixing Rafael's id.

On Thu, May 8, 2014 at 4:53 AM, Doug Anderson <dianders@chromium.org> wrote:
> Downscaling loops_per_jiffy on SMP ARM systems really doesn't work.
> You could really only do this if:
>
> * Each CPU is has independent frequency changes (changing one CPU
>   doesn't affect another).
> * We change the generic ARM udelay() code to actually look at percpu
>   loops_per_jiffy.

There is one more case I believe:
All CPUs share a single clock line and generic udelay() uses global
loops_per_jiffy, as it would never happen that some other CPU is running
faster and udelay would complete early

> I don't know of any ARM CPUs that are totally independent that don't
> just use a timer-based delay anyway.  For those that don't have a
> timer-based delay, we should be conservative and overestimate
> loops_per_jiffy.
>
> Note that on some systems you might sometimes see (in the extreme case
> when we're all the way downclocked) a udelay(100) become a
> udelay(1000) now.
>
> Signed-off-by: Doug Anderson <dianders@chromium.org>

> diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
> index 7c4fada..9d944f6 100644
> --- a/arch/arm/kernel/smp.c
> +++ b/arch/arm/kernel/smp.c
> @@ -649,39 +649,50 @@ int setup_profiling_timer(unsigned int multiplier)
>
>  #ifdef CONFIG_CPU_FREQ
>
> -static DEFINE_PER_CPU(unsigned long, l_p_j_ref);
> -static DEFINE_PER_CPU(unsigned long, l_p_j_ref_freq);
>  static unsigned long global_l_p_j_ref;
>  static unsigned long global_l_p_j_ref_freq;
> +static unsigned long global_l_p_j_max_freq;
> +
> +/**
> + * cpufreq_callback - Adjust loops_per_jiffies when frequency changes
> + *
> + * When the CPU frequency changes we need to adjust loops_per_jiffies, which
> + * we assume scales linearly with frequency.
> + *
> + * This function is fairly castrated and only ever adjust loops_per_jiffies
> + * upward.  It also doesn't adjust the PER_CPU loops_per_jiffies.  Here's why:
> + * 1. The ARM udelay only ever looks at the global loops_per_jiffy not the
> + *    percpu one.  If your CPUs _are not_ changed in lockstep you could run
> + *    into problems by decreasing loops_per_jiffies since one of the other
> + *    processors might still be running slower.
> + * 2. The ARM udelay reads the loops_per_jiffy at the beginning of its loop and
> + *    no other times.  If your CPUs _are_ changed in lockstep you could run
> + *    into a race where one CPU has started its loop with old (slower)
> + *    loops_per_jiffy and then suddenly is running faster.
> + *
> + * Anyone who wants a good udelay() should be using a timer-based solution
> + * anyway.  If you don't have a timer solution, you just gotta be conservative.
> + */
>
>  static int cpufreq_callback(struct notifier_block *nb,
>                                         unsigned long val, void *data)
>  {
>         struct cpufreq_freqs *freq = data;
> -       int cpu = freq->cpu;
>
>         if (freq->flags & CPUFREQ_CONST_LOOPS)
>                 return NOTIFY_OK;
>
> -       if (!per_cpu(l_p_j_ref, cpu)) {
> -               per_cpu(l_p_j_ref, cpu) =
> -                       per_cpu(cpu_data, cpu).loops_per_jiffy;
> -               per_cpu(l_p_j_ref_freq, cpu) = freq->old;
> -               if (!global_l_p_j_ref) {
> -                       global_l_p_j_ref = loops_per_jiffy;
> -                       global_l_p_j_ref_freq = freq->old;
> -               }
> +       if (!global_l_p_j_ref) {
> +               global_l_p_j_ref = loops_per_jiffy;
> +               global_l_p_j_ref_freq = freq->old;
> +               global_l_p_j_max_freq = freq->old;
>         }
>
> -       if ((val == CPUFREQ_PRECHANGE  && freq->old < freq->new) ||
> -           (val == CPUFREQ_POSTCHANGE && freq->old > freq->new)) {
> +       if (freq->new > global_l_p_j_max_freq) {
>                 loops_per_jiffy = cpufreq_scale(global_l_p_j_ref,
>                                                 global_l_p_j_ref_freq,
>                                                 freq->new);
> -               per_cpu(cpu_data, cpu).loops_per_jiffy =
> -                       cpufreq_scale(per_cpu(l_p_j_ref, cpu),
> -                                       per_cpu(l_p_j_ref_freq, cpu),
> -                                       freq->new);
> +               global_l_p_j_max_freq = freq->new;
>         }
>         return NOTIFY_OK;
>  }

So, the end of this will happen when a CPU is set to its highest frequency
for the first time. After that this routine will simply be noise and nothing
more. Isn't it?

And if that's the case, why don't we get rid of it completely and set
global-loop-per-jiffy for the max freq at boot ?

--
viresh

Doug Anderson May 8, 2014, 3:25 p.m. UTC | #2

Viresh,

On Thu, May 8, 2014 at 3:41 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote:
> Fixing Rafael's id.
>
> On Thu, May 8, 2014 at 4:53 AM, Doug Anderson <dianders@chromium.org> wrote:
>> Downscaling loops_per_jiffy on SMP ARM systems really doesn't work.
>> You could really only do this if:
>>
>> * Each CPU is has independent frequency changes (changing one CPU
>>   doesn't affect another).
>> * We change the generic ARM udelay() code to actually look at percpu
>>   loops_per_jiffy.
>
> There is one more case I believe:
> All CPUs share a single clock line and generic udelay() uses global
> loops_per_jiffy, as it would never happen that some other CPU is running
> faster and udelay would complete early

I'm not sure I understood your case.  Maybe you can give me an
example?  I can give you an example myself of the single clock case
and you can tell me how your example differs:

1. Initially CPU1 and CPU2 at 200MHz.  Pretend loops_per_jiffy is 1000.

2. CPU1 starts a delay.  It reads global lpj (1000) and sets up its
local registers up for the loop.

3. At the same time, CPU2 is transitioning the system to 2000MHz.
Right after CPU1 reads lpj CPU2 stores it as 10000.

4. Now CPU1 and CPU2 are running at 2000MHz but CPU1 is only looping
1000 times.  It will complete too fast.

...you could possibly try to account for this in the delay loop code
(being careful to handle all of the corner cases and races).  ...or we
could make the delay loop super conservative and suggest that people
should be using a real timer.

> So, the end of this will happen when a CPU is set to its highest frequency
> for the first time. After that this routine will simply be noise and nothing
> more. Isn't it?

Yup.

> And if that's the case, why don't we get rid of it completely and set
> global-loop-per-jiffy for the max freq at boot ?

How exactly do you do this in a generic way?  I know that our systems
don't always boot up at full speed.  The HP Chromebook 11 might boot
up at 900MHz and stay that way for a while until the battery gets
enough juice.  The Samsung Chromebook 2 will boot up at 1.8GHz
although some of them can go to 1.9GHz and others to 2.0GHz.  Waiting
to actually see the cpufreq transition is a safe way, though it does
end up with some extra function calls.

-Doug

Nicolas Pitre May 8, 2014, 4:04 p.m. UTC | #3

On Thu, 8 May 2014, Doug Anderson wrote:

> 1. Initially CPU1 and CPU2 at 200MHz.  Pretend loops_per_jiffy is 1000.
> 
> 2. CPU1 starts a delay.  It reads global lpj (1000) and sets up its
> local registers up for the loop.
> 
> 3. At the same time, CPU2 is transitioning the system to 2000MHz.
> Right after CPU1 reads lpj CPU2 stores it as 10000.
> 
> 4. Now CPU1 and CPU2 are running at 2000MHz but CPU1 is only looping
> 1000 times.  It will complete too fast.
> 
> ...you could possibly try to account for this in the delay loop code
> (being careful to handle all of the corner cases and races).  ...or we
> could make the delay loop super conservative and suggest that people
> should be using a real timer.

I don't see how you can possibly solve this issue without a timer based 
delay.  Even if you scale the loop count in only one direction, it will 
still have this problem even though the window for the race would happen 
much less often.  Yet having a delay which is way longer than expected 
might cause problems in some cases.

Yet clock frequency changes with the kind of magnitude you give in your 
example are usually not instantaneous.  Given udelay should be used for 
very short delays, it is likely that CPU1 will complete its count before 
the higher clock frequency is effective in most cases.

> How exactly do you do this in a generic way?  I know that our systems
> don't always boot up at full speed.  The HP Chromebook 11 might boot
> up at 900MHz and stay that way for a while until the battery gets
> enough juice.  The Samsung Chromebook 2 will boot up at 1.8GHz
> although some of them can go to 1.9GHz and others to 2.0GHz.  Waiting
> to actually see the cpufreq transition is a safe way, though it does
> end up with some extra function calls.

The Samsung Chromebook uses an A15 which does use a timer based udelay.  
What about the others?  IOW is this a real problem in practice or a 
theoretical one?

SMP with shared clock for DVFS simply doesn't allow pure loop counts to 
always be accurate.  Trying to fix a broken implementation with 
something that is still broken to some extent, and maybe more in 
some cases, doesn't look like much progress to me.

Nicolas

Doug Anderson May 8, 2014, 4:41 p.m. UTC | #4

Nicolas,

On Thu, May 8, 2014 at 9:04 AM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> On Thu, 8 May 2014, Doug Anderson wrote:
>
>> 1. Initially CPU1 and CPU2 at 200MHz.  Pretend loops_per_jiffy is 1000.
>>
>> 2. CPU1 starts a delay.  It reads global lpj (1000) and sets up its
>> local registers up for the loop.
>>
>> 3. At the same time, CPU2 is transitioning the system to 2000MHz.
>> Right after CPU1 reads lpj CPU2 stores it as 10000.
>>
>> 4. Now CPU1 and CPU2 are running at 2000MHz but CPU1 is only looping
>> 1000 times.  It will complete too fast.
>>
>> ...you could possibly try to account for this in the delay loop code
>> (being careful to handle all of the corner cases and races).  ...or we
>> could make the delay loop super conservative and suggest that people
>> should be using a real timer.
>
> I don't see how you can possibly solve this issue without a timer based
> delay.  Even if you scale the loop count in only one direction, it will
> still have this problem even though the window for the race would happen
> much less often.  Yet having a delay which is way longer than expected
> might cause problems in some cases.

You could possibly try to do something excessively clever by checking
the loops per jiffy after the loop was done (and perhaps register for
cpufreq changes so you know if it changed and then changed back)?  As
I said, I don't think it's a good use of anyone's time.

Longer delays aren't very good, but IMHO having some delays of 100 =>
1000 is better than having delays of 100 => 75.  The former will cause
mostly performance problems and the later will cause real correctness
problems.

I'm not saying that 100 => 1000 is good, it's just less bad.
Specifically even in a timer-based system you can't guarantee that a
udelay(100) won't end up a udelay(1000) if the kernel finds something
better to do than to run your code.  I agree that there might be code
that breaks when a udelay(100) becomes a udelay(1000), but probably
that code needs to be fixed to be more tolerant anyway.

When you've got a udelay(100) => udelay(75) then suddenly you're
returning from regulator code before the regulator has fully ramped up
to its final voltage.  You're talking to peripherals faster than they
were intended to be talked to.  ...etc.

> Yet clock frequency changes with the kind of magnitude you give in your
> example are usually not instantaneous.  Given udelay should be used for
> very short delays, it is likely that CPU1 will complete its count before
> the higher clock frequency is effective in most cases.

In practice we found that the frequency changed fast enough in a real
system that we were seeing real problems, at least with a udelay(50).
If you've got a system using lpj-based udelay() that supports cpufreq,
grab David's test patches and try for yourself.  Perhaps I was
mistaken?

>> How exactly do you do this in a generic way?  I know that our systems
>> don't always boot up at full speed.  The HP Chromebook 11 might boot
>> up at 900MHz and stay that way for a while until the battery gets
>> enough juice.  The Samsung Chromebook 2 will boot up at 1.8GHz
>> although some of them can go to 1.9GHz and others to 2.0GHz.  Waiting
>> to actually see the cpufreq transition is a safe way, though it does
>> end up with some extra function calls.
>
> The Samsung Chromebook uses an A15 which does use a timer based udelay.
> What about the others?  IOW is this a real problem in practice or a
> theoretical one?

Correct, the Samsung Chromebook uses a timer-based udelay upstream (it
still doesn't in our tree BTW, but we're working to rectify that).
...that's why I didn't send the patch up initially.  I sent it up at
the request of John Stultz in a discussion about David's test code.

> SMP with shared clock for DVFS simply doesn't allow pure loop counts to
> always be accurate.  Trying to fix a broken implementation with
> something that is still broken to some extent, and maybe more in
> some cases, doesn't look like much progress to me.

I totally agree that this doesn't really fix the problem nicely which
is why I didn't send it initially.

I will make the argument that this patch makes things less broken
overall on any systems that actually end up running this code, but if
you want NAK it then it won't cause me any heartache.  ;)

-Doug

Nicolas Pitre May 8, 2014, 5:43 p.m. UTC | #5

On Thu, 8 May 2014, Doug Anderson wrote:

> Nicolas,
> 
> On Thu, May 8, 2014 at 9:04 AM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> > On Thu, 8 May 2014, Doug Anderson wrote:
> >
> >> 1. Initially CPU1 and CPU2 at 200MHz.  Pretend loops_per_jiffy is 1000.
> >>
> >> 2. CPU1 starts a delay.  It reads global lpj (1000) and sets up its
> >> local registers up for the loop.
> >>
> >> 3. At the same time, CPU2 is transitioning the system to 2000MHz.
> >> Right after CPU1 reads lpj CPU2 stores it as 10000.
> >>
> >> 4. Now CPU1 and CPU2 are running at 2000MHz but CPU1 is only looping
> >> 1000 times.  It will complete too fast.
> >>
> >> ...you could possibly try to account for this in the delay loop code
> >> (being careful to handle all of the corner cases and races).  ...or we
> >> could make the delay loop super conservative and suggest that people
> >> should be using a real timer.
> >
> > I don't see how you can possibly solve this issue without a timer based
> > delay.  Even if you scale the loop count in only one direction, it will
> > still have this problem even though the window for the race would happen
> > much less often.  Yet having a delay which is way longer than expected
> > might cause problems in some cases.
> 
> You could possibly try to do something excessively clever by checking
> the loops per jiffy after the loop was done (and perhaps register for
> cpufreq changes so you know if it changed and then changed back)?  As
> I said, I don't think it's a good use of anyone's time.

Agreed.

> Longer delays aren't very good, but IMHO having some delays of 100 =>
> 1000 is better than having delays of 100 => 75.  The former will cause
> mostly performance problems and the later will cause real correctness
> problems.
> I'm not saying that 100 => 1000 is good, it's just less bad.

There might be some cases where precise timing is needed though.
I thought I came across one such case in the past but I can't remember 
which.

> Specifically even in a timer-based system you can't guarantee that a
> udelay(100) won't end up a udelay(1000) if the kernel finds something
> better to do than to run your code.  I agree that there might be code
> that breaks when a udelay(100) becomes a udelay(1000), but probably
> that code needs to be fixed to be more tolerant anyway.

The timer based udelay implementation does poll the timer, and if exact 
timing is important then you'll certainly turn off IRQs during the 
critical sequence.

[...]
> > SMP with shared clock for DVFS simply doesn't allow pure loop counts to
> > always be accurate.  Trying to fix a broken implementation with
> > something that is still broken to some extent, and maybe more in
> > some cases, doesn't look like much progress to me.
> 
> I totally agree that this doesn't really fix the problem nicely which
> is why I didn't send it initially.
> 
> I will make the argument that this patch makes things less broken
> overall on any systems that actually end up running this code, but if
> you want NAK it then it won't cause me any heartache.  ;)

What I insist on is for this issue to be solved using a stable counter 
such a timer when available.  It _is_ available on one of the target you 
mentioned so that is the solution you should add to your tree.  
Investigating a similar solution for your other target should be 
preferred to hacking the udelay loop. This way you're guaranteed to 
solve this problem fully.


Nicolas

Doug Anderson May 8, 2014, 6:06 p.m. UTC | #6

Nicolas,

On Thu, May 8, 2014 at 10:43 AM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> On Thu, 8 May 2014, Doug Anderson wrote:

>> Longer delays aren't very good, but IMHO having some delays of 100 =>
>> 1000 is better than having delays of 100 => 75.  The former will cause
>> mostly performance problems and the later will cause real correctness
>> problems.
>> I'm not saying that 100 => 1000 is good, it's just less bad.
>
> There might be some cases where precise timing is needed though.
> I thought I came across one such case in the past but I can't remember
> which.

If precise timing is needed, shouldn't it be using ktime?

>> I will make the argument that this patch makes things less broken
>> overall on any systems that actually end up running this code, but if
>> you want NAK it then it won't cause me any heartache.  ;)
>
> What I insist on is for this issue to be solved using a stable counter
> such a timer when available.  It _is_ available on one of the target you
> mentioned so that is the solution you should add to your tree.

Yup, we're working on it.

> Investigating a similar solution for your other target should be
> preferred to hacking the udelay loop. This way you're guaranteed to
> solve this problem fully.

I have no other target in mind.  I'm merely sending this up there just
in case there is some cpufreq running ARM board that is SMP and has no
timer-based udelay.  Those are the only boards that could possibly be
running this code anyway.

I guess I would say that my patch is unhacking the this code.  The
code after my patch is simpler.  I would perhaps argue that (ec971ea
ARM: add cpufreq transiton notifier to adjust loops_per_jiffy for smp)
should never have landed to begin with.

-Doug

Russell King - ARM Linux May 8, 2014, 7:22 p.m. UTC | #7

On Thu, May 08, 2014 at 01:43:29PM -0400, Nicolas Pitre wrote:
> There might be some cases where precise timing is needed though.
> I thought I came across one such case in the past but I can't remember 
> which.

Anything which is expecting precise timings from udelay() is broken.
Firstly, udelay() does _not_ guarantee to give you a delay of at least
the requested period - it tries to give an _approximate_ delay.

The first thing to realise is that loops_per_jiffy is calibrated with
interrupts _on_, which means that the calculated loops_per_jiffy is
the number of iterations in a jiffy _minus_ the time it takes for the
timer interrupt to be processed.  This means loops_per_jiffy will
always be smaller than the number of loops that would be executed
within the same period.

This leads to udelay() always producing slightly shorter than
requested delays - this is quite measurable.

It gets worse when you consider the utter mess that the L2 cache code
is in - where on many platforms we calibrate udelay() with the cache
off, which results in loops_per_jiffy being smaller than it would
otherwise be (meaning shorter delays.)

So, that's two reasons there why loops_per_jiffy will be smaller than
it should be at boot, and therefore udelay() will be inaccurate.

Another reason udelay() can be unaccurate is if interrupts are on, and
you have USB present.  USB interrupt processing can take on the order
of 10s of milliseconds even on 800MHz or faster ARM CPUs.  If you
receive one of those mid-udelay(), your CPU will be occupied elsewhere.

Another reason is preempt.  If the kernel can preempt during udelay(),
your delay will also be much longer than you requested.  No, disabling
preemption in udelay() is not on, that would increase preemption
latency.

So, the only /real/ solution if you want proper delays is for udelay()
to use a timer or counter, and this is should always the preferred
method where it's available.  Quite rightly, we're not hacking udelay()
stuff to work around not having that, or if someone configures it out.

Nicolas Pitre May 8, 2014, 7:59 p.m. UTC | #8

On Thu, 8 May 2014, Doug Anderson wrote:

> Nicolas,
> 
> On Thu, May 8, 2014 at 10:43 AM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> > On Thu, 8 May 2014, Doug Anderson wrote:
> 
> >> Longer delays aren't very good, but IMHO having some delays of 100 =>
> >> 1000 is better than having delays of 100 => 75.  The former will cause
> >> mostly performance problems and the later will cause real correctness
> >> problems.
> >> I'm not saying that 100 => 1000 is good, it's just less bad.
> >
> > There might be some cases where precise timing is needed though.
> > I thought I came across one such case in the past but I can't remember
> > which.
> 
> If precise timing is needed, shouldn't it be using ktime?

I meant "precise" in the sense that you might have to poke at some 
hardware within some deadline e.g. set bit, wait 13us, clear bit, and 
not exceed 15us between both bit events.  Arguably this is best hangled 
with actual FIQs (when they're available).  But I don't have any actual 
case of this to bring as example.

> >> I will make the argument that this patch makes things less broken
> >> overall on any systems that actually end up running this code, but if
> >> you want NAK it then it won't cause me any heartache.  ;)
> >
> > What I insist on is for this issue to be solved using a stable counter
> > such a timer when available.  It _is_ available on one of the target you
> > mentioned so that is the solution you should add to your tree.
> 
> Yup, we're working on it.
> 
> 
> > Investigating a similar solution for your other target should be
> > preferred to hacking the udelay loop. This way you're guaranteed to
> > solve this problem fully.
> 
> I have no other target in mind.  I'm merely sending this up there just
> in case there is some cpufreq running ARM board that is SMP and has no
> timer-based udelay.  Those are the only boards that could possibly be
> running this code anyway.
> 
> I guess I would say that my patch is unhacking the this code.  The
> code after my patch is simpler.  I would perhaps argue that (ec971ea
> ARM: add cpufreq transiton notifier to adjust loops_per_jiffy for smp)
> should never have landed to begin with.

I would agree to qualify your patch as brokenness mitigation and that 
might be better than nothing.  But this should probably print a warning 
to that effect.

Yet, wouldn't using ktime to implement udelay() a better solution in 
that case?  Suffice to make sure a clock source with microsecs 
resolution is available.


Nicolas

Nicolas Pitre May 8, 2014, 8:12 p.m. UTC | #9

On Thu, 8 May 2014, Russell King - ARM Linux wrote:

> Anything which is expecting precise timings from udelay() is broken.
> Firstly, udelay() does _not_ guarantee to give you a delay of at least
> the requested period - it tries to give an _approximate_ delay.
> 
> The first thing to realise is that loops_per_jiffy is calibrated with
> interrupts _on_, which means that the calculated loops_per_jiffy is
> the number of iterations in a jiffy _minus_ the time it takes for the
> timer interrupt to be processed.  This means loops_per_jiffy will
> always be smaller than the number of loops that would be executed
> within the same period.
> 
> This leads to udelay() always producing slightly shorter than
> requested delays - this is quite measurable.

OK, this is certainly bad.  Hopefully it won't be that far off like it 
would when the CPU is in the middle of a clock freq transition.

> It gets worse when you consider the utter mess that the L2 cache code
> is in - where on many platforms we calibrate udelay() with the cache
> off, which results in loops_per_jiffy being smaller than it would
> otherwise be (meaning shorter delays.)
> 
> So, that's two reasons there why loops_per_jiffy will be smaller than
> it should be at boot, and therefore udelay() will be inaccurate.
> 
> Another reason udelay() can be unaccurate is if interrupts are on, and
> you have USB present.  USB interrupt processing can take on the order
> of 10s of milliseconds even on 800MHz or faster ARM CPUs.  If you
> receive one of those mid-udelay(), your CPU will be occupied elsewhere.
> 
> Another reason is preempt.  If the kernel can preempt during udelay(),
> your delay will also be much longer than you requested.  No, disabling
> preemption in udelay() is not on, that would increase preemption
> latency.
> 
> So, the only /real/ solution if you want proper delays is for udelay()
> to use a timer or counter, and this is should always the preferred
> method where it's available.  Quite rightly, we're not hacking udelay()
> stuff to work around not having that, or if someone configures it out.

What about using a default based on ktime_get(), or even sched_clock(), 
when SMP and cpufreq are configured in?


Nicolas

John Stultz May 8, 2014, 8:39 p.m. UTC | #10

On 05/08/2014 01:12 PM, Nicolas Pitre wrote:
> On Thu, 8 May 2014, Russell King - ARM Linux wrote:
>
>> So, the only /real/ solution if you want proper delays is for udelay()
>> to use a timer or counter, and this is should always the preferred
>> method where it's available.  Quite rightly, we're not hacking udelay()
>> stuff to work around not having that, or if someone configures it out.
> What about using a default based on ktime_get(), or even sched_clock(), 
> when SMP and cpufreq are configured in?

While somewhat rare these days, there are some systems that still only
have tick(jiffies)-granular time (as well as sched_clock), which might
be far too coarse for udelay, and in addition depends on irqs being
enabled to be able for time to progress.

The loops based delay is sort of a hack that allows fine grained (but
somewhat inaccurate) time delays in just about any context.

The problem being the loop based delays are somewhat fragile,
particularly as many of the assumptions for that code have been broken
(cpufreq scaling, ASMP, SMIs or other calibration issues, etc).  On
these systems where the assumptions are broken, time-based counter
delays are needed, but not always in place.

Thus I suspect we need something that provides proper warnings when we
are using loop delays on system that breaks its assumptions.

thanks
-john

Russell King - ARM Linux May 8, 2014, 8:52 p.m. UTC | #11

On Thu, May 08, 2014 at 04:12:14PM -0400, Nicolas Pitre wrote:
> On Thu, 8 May 2014, Russell King - ARM Linux wrote:
> 
> > Anything which is expecting precise timings from udelay() is broken.
> > Firstly, udelay() does _not_ guarantee to give you a delay of at least
> > the requested period - it tries to give an _approximate_ delay.
> > 
> > The first thing to realise is that loops_per_jiffy is calibrated with
> > interrupts _on_, which means that the calculated loops_per_jiffy is
> > the number of iterations in a jiffy _minus_ the time it takes for the
> > timer interrupt to be processed.  This means loops_per_jiffy will
> > always be smaller than the number of loops that would be executed
> > within the same period.
> > 
> > This leads to udelay() always producing slightly shorter than
> > requested delays - this is quite measurable.
> 
> OK, this is certainly bad.  Hopefully it won't be that far off like it 
> would when the CPU is in the middle of a clock freq transition.

It depends on the system, but my point is that assumption that udelay()
gives a delay of at least the requested time is something that is false,
and has *always* been false.

It's not "broken" either - it's just how the thing works, and the "fix"
for it is to use a timer based implementation which isn't affected by
interrupts.

> > So, the only /real/ solution if you want proper delays is for udelay()
> > to use a timer or counter, and this is should always the preferred
> > method where it's available.  Quite rightly, we're not hacking udelay()
> > stuff to work around not having that, or if someone configures it out.
> 
> What about using a default based on ktime_get(), or even sched_clock(), 
> when SMP and cpufreq are configured in?

I see no reason to play those kinds of games.  Keep the message simple.
If you're in a preempt or SMP environment, provide a timer for udelay().
IF you're in an environment with IRQs which can take a long time, use
a timer for udelay().  If you're in an environment where the CPU clock
can change unexpectedly, use a timer for udelay().

The very last thing we want to do is to sit around doing expensive calls
into various time keeping code which themselves add conversion on top,
which then end up making udelay() latency even worse than the loop-based
versions on slower machines.

So... the message is nice and simple: where possible, use a timer for
udelay().

Russell King - ARM Linux May 8, 2014, 8:55 p.m. UTC | #12

On Thu, May 08, 2014 at 11:06:24AM -0700, Doug Anderson wrote:
> I guess I would say that my patch is unhacking the this code.  The
> code after my patch is simpler.  I would perhaps argue that (ec971ea
> ARM: add cpufreq transiton notifier to adjust loops_per_jiffy for smp)
> should never have landed to begin with.

That depends on your point of view.  As I've already pointed out through
the examples of why udelay() is inaccurate, for driver authors, they
should assume that udelay() just gives you an "approximate" delay and
it has no accuracy.

When you start from that point, rather than (as you seem to be) believing
that it has some kind of accuracy, then the implementation of scaling the
loops_per_jiffy is entirely reasonable - it's a best effort implementation.

Again, just use a timer for your udelay() implementation.

Doug Anderson May 9, 2014, 12:02 a.m. UTC | #13

Russel,

On Thu, May 8, 2014 at 1:55 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Thu, May 08, 2014 at 11:06:24AM -0700, Doug Anderson wrote:
>> I guess I would say that my patch is unhacking the this code.  The
>> code after my patch is simpler.  I would perhaps argue that (ec971ea
>> ARM: add cpufreq transiton notifier to adjust loops_per_jiffy for smp)
>> should never have landed to begin with.
>
> That depends on your point of view.  As I've already pointed out through
> the examples of why udelay() is inaccurate, for driver authors, they
> should assume that udelay() just gives you an "approximate" delay and
> it has no accuracy.

That disagrees with what Thomas Gleixner says at
<http://lkml.iu.edu//hypermail/linux/kernel/1203.1/01034.html>.  It
also seems like perhaps the regulator core is broken, then...  If a
udelay(30) can end up as a udelay(20) then we may return from a
regulator code 10us earlier than we should and we'll assume that a
regulator is ramped before it really is...

I'm out tomorrow but I can confirm on Monday that I was really seeing
udelay(30) be a udelay(20) without this patch.

Russell King - ARM Linux May 9, 2014, 12:23 a.m. UTC | #14

On Thu, May 08, 2014 at 05:02:02PM -0700, Doug Anderson wrote:
> Russel,
> 
> On Thu, May 8, 2014 at 1:55 PM, Russell King - ARM Linux
> <linux@arm.linux.org.uk> wrote:
> > On Thu, May 08, 2014 at 11:06:24AM -0700, Doug Anderson wrote:
> >> I guess I would say that my patch is unhacking the this code.  The
> >> code after my patch is simpler.  I would perhaps argue that (ec971ea
> >> ARM: add cpufreq transiton notifier to adjust loops_per_jiffy for smp)
> >> should never have landed to begin with.
> >
> > That depends on your point of view.  As I've already pointed out through
> > the examples of why udelay() is inaccurate, for driver authors, they
> > should assume that udelay() just gives you an "approximate" delay and
> > it has no accuracy.
> 
> That disagrees with what Thomas Gleixner says at
> <http://lkml.iu.edu//hypermail/linux/kernel/1203.1/01034.html>.  It
> also seems like perhaps the regulator core is broken, then...  If a
> udelay(30) can end up as a udelay(20) then we may return from a
> regulator code 10us earlier than we should and we'll assume that a
> regulator is ramped before it really is...
> 
> I'm out tomorrow but I can confirm on Monday that I was really seeing
> udelay(30) be a udelay(20) without this patch.

Thomas is wrong - when I researched this topic, I ended up finding
that udelay() does delay _less_ than requested, and I mailed Linus
about it...  This is whe way udelay() is - it's an approximate delay,
it's *not* accurate.

It's also fairly obvious when you stop and consider how it's calibrated.

Take a moment to wonder when we used to recalibrate each CPU individally,
why the boot CPU bogomips would be slightly lower than the secondary
CPU bogomips.  This is all down to the boot CPU having to run timer
interrupts while the secondary CPUs weren't at the time they calibrated.

Linus doesn't give a damn about udelay() being slightly short in this
way.  Neither do I.

Let me repeat again: use a timer.

Doug Anderson May 9, 2014, 4:41 a.m. UTC | #15

Russell,

On Thu, May 8, 2014 at 5:23 PM, Russell King - ARM Linux
<linux@arm.linux.org.uk> wrote:
> On Thu, May 08, 2014 at 05:02:02PM -0700, Doug Anderson wrote:
>> Russel,
>>
>> On Thu, May 8, 2014 at 1:55 PM, Russell King - ARM Linux
>> <linux@arm.linux.org.uk> wrote:
>> > On Thu, May 08, 2014 at 11:06:24AM -0700, Doug Anderson wrote:
>> >> I guess I would say that my patch is unhacking the this code.  The
>> >> code after my patch is simpler.  I would perhaps argue that (ec971ea
>> >> ARM: add cpufreq transiton notifier to adjust loops_per_jiffy for smp)
>> >> should never have landed to begin with.
>> >
>> > That depends on your point of view.  As I've already pointed out through
>> > the examples of why udelay() is inaccurate, for driver authors, they
>> > should assume that udelay() just gives you an "approximate" delay and
>> > it has no accuracy.
>>
>> That disagrees with what Thomas Gleixner says at
>> <http://lkml.iu.edu//hypermail/linux/kernel/1203.1/01034.html>.  It
>> also seems like perhaps the regulator core is broken, then...  If a
>> udelay(30) can end up as a udelay(20) then we may return from a
>> regulator code 10us earlier than we should and we'll assume that a
>> regulator is ramped before it really is...
>>
>> I'm out tomorrow but I can confirm on Monday that I was really seeing
>> udelay(30) be a udelay(20) without this patch.
>
> Thomas is wrong - when I researched this topic, I ended up finding
> that udelay() does delay _less_ than requested, and I mailed Linus
> about it...  This is whe way udelay() is - it's an approximate delay,
> it's *not* accurate.
>
> It's also fairly obvious when you stop and consider how it's calibrated.
>
> Take a moment to wonder when we used to recalibrate each CPU individally,
> why the boot CPU bogomips would be slightly lower than the secondary
> CPU bogomips.  This is all down to the boot CPU having to run timer
> interrupts while the secondary CPUs weren't at the time they calibrated.
>
> Linus doesn't give a damn about udelay() being slightly short in this
> way.  Neither do I.

Can you define what you mean by "slightly"?  I'm personally not super
concerned by udelay(50) becoming udelay(49), but...

I've got a test that hammers on udelay and tests it against ktime.  It
runs for _multiple minutes_ and give nearly spot on udelay(50) values.
 Then, it gets unlucky and gives something like this (values reported
in nanoseconds):

[  467.034522] 0. want=50000, got=308708
[  467.034522] 1. want=50000, got=51917
[  467.034522] 2. want=50000, got=51167
[  467.034522] 3. want=50000, got=76291
[  467.034522] 4. want=50000, got=149959
[  467.034522] 5. want=50000, got=76458
[  467.034522] 6. want=50000, got=39583
[  467.034522] 7. want=50000, got=19459
[  467.034522] 8. want=50000, got=20625
[  467.034522] 9. want=50000, got=20375

It's things like "want 50000, got 19459" that concern me.  It's even
more concerning that it only happens after multiple minutes, meaning
that it will lead to random, impossible to reproduce bugs.

NOTE: I haven't traced through to figure out exactly what scenario
causes the above to happen.  In my case I'm simply looping over some
test code using
<https://chromium-review.googlesource.com/#/c/189760/8/init/calibrate.c>.
 ...but if we stop screwing with loops_per_jiffy it definitely doesn't
happen.


> Let me repeat again: use a timer.

Totally agree.  We are using a timer going forward.  I was posting
this patch to help those less fortunate who happen to be running on a
system where the timer isn't available for whatever reason.

-Doug

Viresh Kumar May 9, 2014, 9:25 a.m. UTC | #16

On 8 May 2014 20:55, Doug Anderson <dianders@chromium.org> wrote:
> On Thu, May 8, 2014 at 3:41 AM, Viresh Kumar <viresh.kumar@linaro.org> wrote:

>> And if that's the case, why don't we get rid of it completely and set
>> global-loop-per-jiffy for the max freq at boot ?
>
> How exactly do you do this in a generic way?

We could have requested for the highest possible frequency value
from cpufreq and adjust accordingly at boot.

But probably thread is turning in another direction now, so lets wait
to see what others have in box.

ARM: Don't ever downscale loops_per_jiffy in SMP systems

Commit Message

Comments

Patch