diff mbox

[RFC] Improving udelay/ndelay on platforms where that is possible

Message ID 11393e07-b042-180c-3bcd-484bf51eada6@sigmadesigns.com (mailing list archive)
State New, archived
Headers show

Commit Message

Marc Gonzalez Nov. 15, 2017, 12:51 p.m. UTC
On 01/11/2017 20:38, Marc Gonzalez wrote:

> OK, I'll just send my patch, and then crawl back under my rock.

Linus,

As promised, the patch is provided below. And as promised, I will
no longer bring this up on LKML.

FWIW, I have checked that the computed value matches the expected
value for all HZ and delay_us, and for a few clock frequencies,
using the following program:

$ cat delays.c
#include <stdio.h>
#define MEGA 1000000u
typedef unsigned int uint;
typedef unsigned long long u64;
#define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))

static const uint HZ_tab[] = { 100, 250, 300, 1000 };

static void check_cycle_count(uint freq, uint HZ, uint delay_us)
{
	uint UDELAY_MULT = (2147 * HZ) + (483648 * HZ / MEGA);
	uint lpj = DIV_ROUND_UP(freq, HZ);
	uint computed = ((u64)lpj * delay_us * UDELAY_MULT >> 31) + 1;
	uint expected = DIV_ROUND_UP((u64)delay_us * freq, MEGA);

	if (computed != expected)
		printf("freq=%u HZ=%u delay_us=%u comp=%u exp=%u\n", freq, HZ, delay_us, computed, expected);
}

int main(void)
{
	uint idx, delay_us, freq;

	for (freq = 3*MEGA; freq <= 100*MEGA; freq += 3*MEGA)
		for (idx = 0; idx < sizeof HZ_tab / sizeof *HZ_tab; ++idx)
			for (delay_us = 1; delay_us <= 2000; ++delay_us)
				check_cycle_count(freq, HZ_tab[idx], delay_us);

	return 0;
}



-- >8 --
Subject: [PATCH] ARM: Tweak clock-based udelay implementation

In 9f8197980d87a ("delay: Add explanation of udelay() inaccuracy")
Russell pointed out that loop-based delays may return early.

On the arm platform, delays may be either loop-based or clock-based.

This patch tweaks the clock-based implementation so that udelay(N)
is guaranteed to spin at least N microseconds.

Signed-off-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
---
 arch/arm/lib/delay.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Comments

Russell King (Oracle) Nov. 15, 2017, 1:13 p.m. UTC | #1
On Wed, Nov 15, 2017 at 01:51:54PM +0100, Marc Gonzalez wrote:
> On 01/11/2017 20:38, Marc Gonzalez wrote:
> 
> > OK, I'll just send my patch, and then crawl back under my rock.
> 
> Linus,
> 
> As promised, the patch is provided below. And as promised, I will
> no longer bring this up on LKML.
> 
> FWIW, I have checked that the computed value matches the expected
> value for all HZ and delay_us, and for a few clock frequencies,
> using the following program:
> 
> $ cat delays.c
> #include <stdio.h>
> #define MEGA 1000000u
> typedef unsigned int uint;
> typedef unsigned long long u64;
> #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
> 
> static const uint HZ_tab[] = { 100, 250, 300, 1000 };
> 
> static void check_cycle_count(uint freq, uint HZ, uint delay_us)
> {
> 	uint UDELAY_MULT = (2147 * HZ) + (483648 * HZ / MEGA);
> 	uint lpj = DIV_ROUND_UP(freq, HZ);
> 	uint computed = ((u64)lpj * delay_us * UDELAY_MULT >> 31) + 1;
> 	uint expected = DIV_ROUND_UP((u64)delay_us * freq, MEGA);
> 
> 	if (computed != expected)
> 		printf("freq=%u HZ=%u delay_us=%u comp=%u exp=%u\n", freq, HZ, delay_us, computed, expected);
> }
> 
> int main(void)
> {
> 	uint idx, delay_us, freq;
> 
> 	for (freq = 3*MEGA; freq <= 100*MEGA; freq += 3*MEGA)
> 		for (idx = 0; idx < sizeof HZ_tab / sizeof *HZ_tab; ++idx)
> 			for (delay_us = 1; delay_us <= 2000; ++delay_us)
> 				check_cycle_count(freq, HZ_tab[idx], delay_us);
> 
> 	return 0;
> }
> 
> 
> 
> -- >8 --
> Subject: [PATCH] ARM: Tweak clock-based udelay implementation
> 
> In 9f8197980d87a ("delay: Add explanation of udelay() inaccuracy")
> Russell pointed out that loop-based delays may return early.
> 
> On the arm platform, delays may be either loop-based or clock-based.
> 
> This patch tweaks the clock-based implementation so that udelay(N)
> is guaranteed to spin at least N microseconds.

As I've already said, I don't want this, because it encourages people
to use too-small delays in driver code, and if we merge it then you
will look at your data sheet, decide it says "you need to wait 10us"
and write in your driver "udelay(10)" which will break on the loops
based delay.

udelay() needs to offer a consistent interface so that drivers know
what to expect no matter what the implementation is.  Making one
implementation conform to your ideas while leaving the other
implementations with other expectations is a recipe for bugs.

If you really want to do this, fix the loops_per_jiffy implementation
as well so that the consistency is maintained.
Doug Anderson Nov. 15, 2017, 6:45 p.m. UTC | #2
Hi,

On Wed, Nov 15, 2017 at 4:51 AM, Marc Gonzalez
<marc_gonzalez@sigmadesigns.com> wrote:
> On 01/11/2017 20:38, Marc Gonzalez wrote:
>
>> OK, I'll just send my patch, and then crawl back under my rock.
>
> Linus,
>
> As promised, the patch is provided below. And as promised, I will
> no longer bring this up on LKML.
>
> FWIW, I have checked that the computed value matches the expected
> value for all HZ and delay_us, and for a few clock frequencies,
> using the following program:
>
> $ cat delays.c
> #include <stdio.h>
> #define MEGA 1000000u
> typedef unsigned int uint;
> typedef unsigned long long u64;
> #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
>
> static const uint HZ_tab[] = { 100, 250, 300, 1000 };
>
> static void check_cycle_count(uint freq, uint HZ, uint delay_us)
> {
>         uint UDELAY_MULT = (2147 * HZ) + (483648 * HZ / MEGA);
>         uint lpj = DIV_ROUND_UP(freq, HZ);
>         uint computed = ((u64)lpj * delay_us * UDELAY_MULT >> 31) + 1;
>         uint expected = DIV_ROUND_UP((u64)delay_us * freq, MEGA);
>
>         if (computed != expected)
>                 printf("freq=%u HZ=%u delay_us=%u comp=%u exp=%u\n", freq, HZ, delay_us, computed, expected);
> }
>
> int main(void)
> {
>         uint idx, delay_us, freq;
>
>         for (freq = 3*MEGA; freq <= 100*MEGA; freq += 3*MEGA)
>                 for (idx = 0; idx < sizeof HZ_tab / sizeof *HZ_tab; ++idx)
>                         for (delay_us = 1; delay_us <= 2000; ++delay_us)
>                                 check_cycle_count(freq, HZ_tab[idx], delay_us);
>
>         return 0;
> }
>
>
>
> -- >8 --
> Subject: [PATCH] ARM: Tweak clock-based udelay implementation
>
> In 9f8197980d87a ("delay: Add explanation of udelay() inaccuracy")
> Russell pointed out that loop-based delays may return early.
>
> On the arm platform, delays may be either loop-based or clock-based.
>
> This patch tweaks the clock-based implementation so that udelay(N)
> is guaranteed to spin at least N microseconds.
>
> Signed-off-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
> ---
>  arch/arm/lib/delay.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)

As I have indicated in the past, I'm not a believer in the "don't fix
bug A because bug B is still there" argument.  From the statements
"platform code could try to make their udelay/ndelay() be as good as
it can be on a particular platform" and "I'm very much open to udelay
improvements, and if somebody sends patches for particular platforms
to do particularly well on that platform" it's my understanding that
this is consistent with Linus's opinion.  Since Marc's bugfix seems
good and valid:

Reviewed-by: Douglas Anderson <dianders@chromium.org>

Marc's bugfix would immediately be useful if you happened to know your
driver was only running on a system that was using a timer-based
udelay on ARM.

Marc's bugfix could also form the basis of future patches that
extended the udelay() API to somehow express the error, as Linus
suggested by saying "we could maybe export some interface to give
estimated errors so that drivers could then try to correct for them
depending on just how much they care".


-Doug
Marc Gonzalez Nov. 16, 2017, 3:26 p.m. UTC | #3
On 15/11/2017 14:13, Russell King - ARM Linux wrote:

> udelay() needs to offer a consistent interface so that drivers know
> what to expect no matter what the implementation is.  Making one
> implementation conform to your ideas while leaving the other
> implementations with other expectations is a recipe for bugs.
> 
> If you really want to do this, fix the loops_per_jiffy implementation
> as well so that the consistency is maintained.

Hello Russell,

It seems to me that, when using DFS, there's a serious issue with loop-based
delays. (IIRC, it was you who pointed this out a few years ago.)

If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled
when the frequency changes.

But arch/arm/lib/delay-loop.S starts by loading the current value of
loops_per_jiffy, computes the number of times to loop, and then loops.
If the frequency increases when the core is in __loop_delay, the
delay will be much shorter than requested.

Is this a correct assessment of the situation?

(BTW, does arch/arm/lib/delay-loop.S load the per_cpu loops_per_jiffy
or the system-wide variable?)

Should loop-based delays be disabled when CPUFREQ is enabled?

Regards.
Russell King (Oracle) Nov. 16, 2017, 3:36 p.m. UTC | #4
On Thu, Nov 16, 2017 at 04:26:51PM +0100, Marc Gonzalez wrote:
> On 15/11/2017 14:13, Russell King - ARM Linux wrote:
> 
> > udelay() needs to offer a consistent interface so that drivers know
> > what to expect no matter what the implementation is.  Making one
> > implementation conform to your ideas while leaving the other
> > implementations with other expectations is a recipe for bugs.
> > 
> > If you really want to do this, fix the loops_per_jiffy implementation
> > as well so that the consistency is maintained.
> 
> Hello Russell,
> 
> It seems to me that, when using DFS, there's a serious issue with loop-based
> delays. (IIRC, it was you who pointed this out a few years ago.)
> 
> If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled
> when the frequency changes.
> 
> But arch/arm/lib/delay-loop.S starts by loading the current value of
> loops_per_jiffy, computes the number of times to loop, and then loops.
> If the frequency increases when the core is in __loop_delay, the
> delay will be much shorter than requested.
> 
> Is this a correct assessment of the situation?

Absolutely correct, and it's something that people are aware of, and
have already catered for while writing their drivers.

> (BTW, does arch/arm/lib/delay-loop.S load the per_cpu loops_per_jiffy
> or the system-wide variable?)
> 
> Should loop-based delays be disabled when CPUFREQ is enabled?

What about platforms (and there are those in the kernel today) which
have CPUFREQ enabled and also have no timer based delay registered?
These rely on using the delay loop mechanism today.

What this means is you can't just "turn off" loop-based delays just
because CPUFREQ is enabled, because that's going to cause regressions.
Marc Gonzalez Nov. 16, 2017, 3:47 p.m. UTC | #5
On 16/11/2017 16:36, Russell King - ARM Linux wrote:
> On Thu, Nov 16, 2017 at 04:26:51PM +0100, Marc Gonzalez wrote:
>> On 15/11/2017 14:13, Russell King - ARM Linux wrote:
>>
>>> udelay() needs to offer a consistent interface so that drivers know
>>> what to expect no matter what the implementation is.  Making one
>>> implementation conform to your ideas while leaving the other
>>> implementations with other expectations is a recipe for bugs.
>>>
>>> If you really want to do this, fix the loops_per_jiffy implementation
>>> as well so that the consistency is maintained.
>>
>> Hello Russell,
>>
>> It seems to me that, when using DFS, there's a serious issue with loop-based
>> delays. (IIRC, it was you who pointed this out a few years ago.)
>>
>> If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled
>> when the frequency changes.
>>
>> But arch/arm/lib/delay-loop.S starts by loading the current value of
>> loops_per_jiffy, computes the number of times to loop, and then loops.
>> If the frequency increases when the core is in __loop_delay, the
>> delay will be much shorter than requested.
>>
>> Is this a correct assessment of the situation?
> 
> Absolutely correct, and it's something that people are aware of, and
> have already catered for while writing their drivers.

In their cpufreq driver?
In "real" device drivers that happen to use delays?

On my system, the CPU frequency may ramp up from 120 MHz to 1.2 GHz.
If the frequency increases at the beginning of __loop_delay, udelay(100)
would spin only 10 microseconds. This is likely to cause issues in
any driver using udelay.

How does one cater for that?

Regards.
Nicolas Pitre Nov. 16, 2017, 4:08 p.m. UTC | #6
On Thu, 16 Nov 2017, Marc Gonzalez wrote:

> On 16/11/2017 16:36, Russell King - ARM Linux wrote:
> > On Thu, Nov 16, 2017 at 04:26:51PM +0100, Marc Gonzalez wrote:
> >> On 15/11/2017 14:13, Russell King - ARM Linux wrote:
> >>
> >>> udelay() needs to offer a consistent interface so that drivers know
> >>> what to expect no matter what the implementation is.  Making one
> >>> implementation conform to your ideas while leaving the other
> >>> implementations with other expectations is a recipe for bugs.
> >>>
> >>> If you really want to do this, fix the loops_per_jiffy implementation
> >>> as well so that the consistency is maintained.
> >>
> >> Hello Russell,
> >>
> >> It seems to me that, when using DFS, there's a serious issue with loop-based
> >> delays. (IIRC, it was you who pointed this out a few years ago.)
> >>
> >> If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled
> >> when the frequency changes.
> >>
> >> But arch/arm/lib/delay-loop.S starts by loading the current value of
> >> loops_per_jiffy, computes the number of times to loop, and then loops.
> >> If the frequency increases when the core is in __loop_delay, the
> >> delay will be much shorter than requested.
> >>
> >> Is this a correct assessment of the situation?
> > 
> > Absolutely correct, and it's something that people are aware of, and
> > have already catered for while writing their drivers.
> 
> In their cpufreq driver?
> In "real" device drivers that happen to use delays?
> 
> On my system, the CPU frequency may ramp up from 120 MHz to 1.2 GHz.
> If the frequency increases at the beginning of __loop_delay, udelay(100)
> would spin only 10 microseconds. This is likely to cause issues in
> any driver using udelay.
> 
> How does one cater for that?

You make sure your delays are based on a stable hardware timer.
Most platforms nowdays should have a suitable timer source.


Nicolas
Marc Gonzalez Nov. 16, 2017, 4:26 p.m. UTC | #7
On 16/11/2017 17:08, Nicolas Pitre wrote:

> On Thu, 16 Nov 2017, Marc Gonzalez wrote:
> 
>> On 16/11/2017 16:36, Russell King - ARM Linux wrote:
>>> On Thu, Nov 16, 2017 at 04:26:51PM +0100, Marc Gonzalez wrote:
>>>> On 15/11/2017 14:13, Russell King - ARM Linux wrote:
>>>>
>>>>> udelay() needs to offer a consistent interface so that drivers know
>>>>> what to expect no matter what the implementation is.  Making one
>>>>> implementation conform to your ideas while leaving the other
>>>>> implementations with other expectations is a recipe for bugs.
>>>>>
>>>>> If you really want to do this, fix the loops_per_jiffy implementation
>>>>> as well so that the consistency is maintained.
>>>>
>>>> Hello Russell,
>>>>
>>>> It seems to me that, when using DFS, there's a serious issue with loop-based
>>>> delays. (IIRC, it was you who pointed this out a few years ago.)
>>>>
>>>> If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled
>>>> when the frequency changes.
>>>>
>>>> But arch/arm/lib/delay-loop.S starts by loading the current value of
>>>> loops_per_jiffy, computes the number of times to loop, and then loops.
>>>> If the frequency increases when the core is in __loop_delay, the
>>>> delay will be much shorter than requested.
>>>>
>>>> Is this a correct assessment of the situation?
>>>
>>> Absolutely correct, and it's something that people are aware of, and
>>> have already catered for while writing their drivers.
>>
>> In their cpufreq driver?
>> In "real" device drivers that happen to use delays?
>>
>> On my system, the CPU frequency may ramp up from 120 MHz to 1.2 GHz.
>> If the frequency increases at the beginning of __loop_delay, udelay(100)
>> would spin only 10 microseconds. This is likely to cause issues in
>> any driver using udelay.
>>
>> How does one cater for that?
> 
> You make sure your delays are based on a stable hardware timer.
> Most platforms nowadays should have a suitable timer source.

So you propose fixing loop-based delays by using clock-based delays,
is that correct? (That is indeed what I did on my platform.)

Russell stated that there are platforms using loop-based delays with
cpufreq enabled. I'm asking how they manage the brokenness.

Regards.
Russell King (Oracle) Nov. 16, 2017, 4:32 p.m. UTC | #8
On Thu, Nov 16, 2017 at 05:26:32PM +0100, Marc Gonzalez wrote:
> On 16/11/2017 17:08, Nicolas Pitre wrote:
> 
> > On Thu, 16 Nov 2017, Marc Gonzalez wrote:
> > 
> >> On 16/11/2017 16:36, Russell King - ARM Linux wrote:
> >>> On Thu, Nov 16, 2017 at 04:26:51PM +0100, Marc Gonzalez wrote:
> >>>> On 15/11/2017 14:13, Russell King - ARM Linux wrote:
> >>>>
> >>>>> udelay() needs to offer a consistent interface so that drivers know
> >>>>> what to expect no matter what the implementation is.  Making one
> >>>>> implementation conform to your ideas while leaving the other
> >>>>> implementations with other expectations is a recipe for bugs.
> >>>>>
> >>>>> If you really want to do this, fix the loops_per_jiffy implementation
> >>>>> as well so that the consistency is maintained.
> >>>>
> >>>> Hello Russell,
> >>>>
> >>>> It seems to me that, when using DFS, there's a serious issue with loop-based
> >>>> delays. (IIRC, it was you who pointed this out a few years ago.)
> >>>>
> >>>> If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled
> >>>> when the frequency changes.
> >>>>
> >>>> But arch/arm/lib/delay-loop.S starts by loading the current value of
> >>>> loops_per_jiffy, computes the number of times to loop, and then loops.
> >>>> If the frequency increases when the core is in __loop_delay, the
> >>>> delay will be much shorter than requested.
> >>>>
> >>>> Is this a correct assessment of the situation?
> >>>
> >>> Absolutely correct, and it's something that people are aware of, and
> >>> have already catered for while writing their drivers.
> >>
> >> In their cpufreq driver?
> >> In "real" device drivers that happen to use delays?
> >>
> >> On my system, the CPU frequency may ramp up from 120 MHz to 1.2 GHz.
> >> If the frequency increases at the beginning of __loop_delay, udelay(100)
> >> would spin only 10 microseconds. This is likely to cause issues in
> >> any driver using udelay.
> >>
> >> How does one cater for that?
> > 
> > You make sure your delays are based on a stable hardware timer.
> > Most platforms nowadays should have a suitable timer source.
> 
> So you propose fixing loop-based delays by using clock-based delays,
> is that correct? (That is indeed what I did on my platform.)
> 
> Russell stated that there are platforms using loop-based delays with
> cpufreq enabled. I'm asking how they manage the brokenness.

Quite simply, they don't have such a wide range of frequencies that can
be selected.  They're on the order of 4x.  For example, the original
platform that cpufreq was developed on, a StrongARM-1110 board, can
practically range from 221MHz down to 59MHz.

BTW, your example above is incorrect.  If I'm not mistaken "120 MHz to
1.2 GHz." is only a 10x range, not a 100x range.  That would mean if
udelay(100) is initially executed while at 1.2GHz, and the frequency
drops to 120MHz during that delay, the delay could be anything between
just short of 100us to just short of 1ms.

For 10ms to come into it, you'd need to range from 1.2GHz down to
0.012GHz, iow, 12MHz.
Marc Gonzalez Nov. 16, 2017, 4:42 p.m. UTC | #9
On 16/11/2017 17:32, Russell King - ARM Linux wrote:
> On Thu, Nov 16, 2017 at 05:26:32PM +0100, Marc Gonzalez wrote:
>> On 16/11/2017 17:08, Nicolas Pitre wrote:
>>
>>> On Thu, 16 Nov 2017, Marc Gonzalez wrote:
>>>
>>>> On 16/11/2017 16:36, Russell King - ARM Linux wrote:
>>>>> On Thu, Nov 16, 2017 at 04:26:51PM +0100, Marc Gonzalez wrote:
>>>>>> On 15/11/2017 14:13, Russell King - ARM Linux wrote:
>>>>>>
>>>>>>> udelay() needs to offer a consistent interface so that drivers know
>>>>>>> what to expect no matter what the implementation is.  Making one
>>>>>>> implementation conform to your ideas while leaving the other
>>>>>>> implementations with other expectations is a recipe for bugs.
>>>>>>>
>>>>>>> If you really want to do this, fix the loops_per_jiffy implementation
>>>>>>> as well so that the consistency is maintained.
>>>>>>
>>>>>> Hello Russell,
>>>>>>
>>>>>> It seems to me that, when using DFS, there's a serious issue with loop-based
>>>>>> delays. (IIRC, it was you who pointed this out a few years ago.)
>>>>>>
>>>>>> If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled
>>>>>> when the frequency changes.
>>>>>>
>>>>>> But arch/arm/lib/delay-loop.S starts by loading the current value of
>>>>>> loops_per_jiffy, computes the number of times to loop, and then loops.
>>>>>> If the frequency increases when the core is in __loop_delay, the
>>>>>> delay will be much shorter than requested.
>>>>>>
>>>>>> Is this a correct assessment of the situation?
>>>>>
>>>>> Absolutely correct, and it's something that people are aware of, and
>>>>> have already catered for while writing their drivers.
>>>>
>>>> In their cpufreq driver?
>>>> In "real" device drivers that happen to use delays?
>>>>
>>>> On my system, the CPU frequency may ramp up from 120 MHz to 1.2 GHz.
>>>> If the frequency increases at the beginning of __loop_delay, udelay(100)
>>>> would spin only 10 microseconds. This is likely to cause issues in
>>>> any driver using udelay.
>>>>
>>>> How does one cater for that?
>>>
>>> You make sure your delays are based on a stable hardware timer.
>>> Most platforms nowadays should have a suitable timer source.
>>
>> So you propose fixing loop-based delays by using clock-based delays,
>> is that correct? (That is indeed what I did on my platform.)
>>
>> Russell stated that there are platforms using loop-based delays with
>> cpufreq enabled. I'm asking how they manage the brokenness.
> 
> Quite simply, they don't have such a wide range of frequencies that can
> be selected.  They're on the order of 4x.  For example, the original
> platform that cpufreq was developed on, a StrongARM-1110 board, can
> practically range from 221MHz down to 59MHz.

Requesting 100 µs and spinning only 25 µs is still a problem,
don't you agree?

> BTW, your example above is incorrect.

A 10x increase in frequency causes a request of 100 µs to spin
only 10 µs, as written above.

The problem is not when the frequency drops -- this makes the
delay longer. The problem is when the frequency increases,
which makes the delay shorter.

Regards.
Nicolas Pitre Nov. 16, 2017, 4:47 p.m. UTC | #10
On Thu, 16 Nov 2017, Marc Gonzalez wrote:

> On 16/11/2017 17:08, Nicolas Pitre wrote:
> 
> > On Thu, 16 Nov 2017, Marc Gonzalez wrote:
> > 
> >> On 16/11/2017 16:36, Russell King - ARM Linux wrote:
> >>> On Thu, Nov 16, 2017 at 04:26:51PM +0100, Marc Gonzalez wrote:
> >>>> On 15/11/2017 14:13, Russell King - ARM Linux wrote:
> >>>>
> >>>>> udelay() needs to offer a consistent interface so that drivers know
> >>>>> what to expect no matter what the implementation is.  Making one
> >>>>> implementation conform to your ideas while leaving the other
> >>>>> implementations with other expectations is a recipe for bugs.
> >>>>>
> >>>>> If you really want to do this, fix the loops_per_jiffy implementation
> >>>>> as well so that the consistency is maintained.
> >>>>
> >>>> Hello Russell,
> >>>>
> >>>> It seems to me that, when using DFS, there's a serious issue with loop-based
> >>>> delays. (IIRC, it was you who pointed this out a few years ago.)
> >>>>
> >>>> If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled
> >>>> when the frequency changes.
> >>>>
> >>>> But arch/arm/lib/delay-loop.S starts by loading the current value of
> >>>> loops_per_jiffy, computes the number of times to loop, and then loops.
> >>>> If the frequency increases when the core is in __loop_delay, the
> >>>> delay will be much shorter than requested.
> >>>>
> >>>> Is this a correct assessment of the situation?
> >>>
> >>> Absolutely correct, and it's something that people are aware of, and
> >>> have already catered for while writing their drivers.
> >>
> >> In their cpufreq driver?
> >> In "real" device drivers that happen to use delays?
> >>
> >> On my system, the CPU frequency may ramp up from 120 MHz to 1.2 GHz.
> >> If the frequency increases at the beginning of __loop_delay, udelay(100)
> >> would spin only 10 microseconds. This is likely to cause issues in
> >> any driver using udelay.
> >>
> >> How does one cater for that?
> > 
> > You make sure your delays are based on a stable hardware timer.
> > Most platforms nowadays should have a suitable timer source.
> 
> So you propose fixing loop-based delays by using clock-based delays,
> is that correct? (That is indeed what I did on my platform.)
> 
> Russell stated that there are platforms using loop-based delays with
> cpufreq enabled. I'm asking how they manage the brokenness.

Look at cpufreq_callback() in arch/arm/kernel/smp.c.


Nicolas
Marc Gonzalez Nov. 16, 2017, 4:51 p.m. UTC | #11
On 16/11/2017 17:47, Nicolas Pitre wrote:

> Look at cpufreq_callback() in arch/arm/kernel/smp.c.

Are you pointing at the scaling of loops_per_jiffy done in that function?

As I wrote earlier:

If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled
when the frequency changes.

But arch/arm/lib/delay-loop.S starts by loading the current value of
loops_per_jiffy, computes the number of times to loop, and then loops.
If the frequency increases when the core is in __loop_delay, the
delay will be much shorter than requested.
Nicolas Pitre Nov. 16, 2017, 5 p.m. UTC | #12
On Thu, 16 Nov 2017, Marc Gonzalez wrote:

> On 16/11/2017 17:47, Nicolas Pitre wrote:
> 
> > Look at cpufreq_callback() in arch/arm/kernel/smp.c.
> 
> Are you pointing at the scaling of loops_per_jiffy done in that function?
> 
> As I wrote earlier:
> 
> If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled
> when the frequency changes.
> 
> But arch/arm/lib/delay-loop.S starts by loading the current value of
> loops_per_jiffy, computes the number of times to loop, and then loops.
> If the frequency increases when the core is in __loop_delay, the
> delay will be much shorter than requested.

The callback is invoked with CPUFREQ_PRECHANGE before the actual 
frequency increase. If your CPU clock is per core, then you won't be in 
the middle of the delay loop when this happens, unless you change your 
core clock from an interrupt handler.

If your CPU clock is common to all cores then you are screwed. In that 
case the only way out is a hardware timer based delay.


Nicolas
Russell King (Oracle) Nov. 16, 2017, 5:05 p.m. UTC | #13
On Thu, Nov 16, 2017 at 05:42:36PM +0100, Marc Gonzalez wrote:
> On 16/11/2017 17:32, Russell King - ARM Linux wrote:
> > On Thu, Nov 16, 2017 at 05:26:32PM +0100, Marc Gonzalez wrote:
> >> On 16/11/2017 17:08, Nicolas Pitre wrote:
> >>
> >>> On Thu, 16 Nov 2017, Marc Gonzalez wrote:
> >>>
> >>>> On 16/11/2017 16:36, Russell King - ARM Linux wrote:
> >>>>> On Thu, Nov 16, 2017 at 04:26:51PM +0100, Marc Gonzalez wrote:
> >>>>>> On 15/11/2017 14:13, Russell King - ARM Linux wrote:
> >>>>>>
> >>>>>>> udelay() needs to offer a consistent interface so that drivers know
> >>>>>>> what to expect no matter what the implementation is.  Making one
> >>>>>>> implementation conform to your ideas while leaving the other
> >>>>>>> implementations with other expectations is a recipe for bugs.
> >>>>>>>
> >>>>>>> If you really want to do this, fix the loops_per_jiffy implementation
> >>>>>>> as well so that the consistency is maintained.
> >>>>>>
> >>>>>> Hello Russell,
> >>>>>>
> >>>>>> It seems to me that, when using DFS, there's a serious issue with loop-based
> >>>>>> delays. (IIRC, it was you who pointed this out a few years ago.)
> >>>>>>
> >>>>>> If I'm reading arch/arm/kernel/smp.c correctly, loops_per_jiffy is scaled
> >>>>>> when the frequency changes.
> >>>>>>
> >>>>>> But arch/arm/lib/delay-loop.S starts by loading the current value of
> >>>>>> loops_per_jiffy, computes the number of times to loop, and then loops.
> >>>>>> If the frequency increases when the core is in __loop_delay, the
> >>>>>> delay will be much shorter than requested.
> >>>>>>
> >>>>>> Is this a correct assessment of the situation?
> >>>>>
> >>>>> Absolutely correct, and it's something that people are aware of, and
> >>>>> have already catered for while writing their drivers.
> >>>>
> >>>> In their cpufreq driver?
> >>>> In "real" device drivers that happen to use delays?
> >>>>
> >>>> On my system, the CPU frequency may ramp up from 120 MHz to 1.2 GHz.
> >>>> If the frequency increases at the beginning of __loop_delay, udelay(100)
> >>>> would spin only 10 microseconds. This is likely to cause issues in
> >>>> any driver using udelay.
> >>>>
> >>>> How does one cater for that?
> >>>
> >>> You make sure your delays are based on a stable hardware timer.
> >>> Most platforms nowadays should have a suitable timer source.
> >>
> >> So you propose fixing loop-based delays by using clock-based delays,
> >> is that correct? (That is indeed what I did on my platform.)
> >>
> >> Russell stated that there are platforms using loop-based delays with
> >> cpufreq enabled. I'm asking how they manage the brokenness.
> > 
> > Quite simply, they don't have such a wide range of frequencies that can
> > be selected.  They're on the order of 4x.  For example, the original
> > platform that cpufreq was developed on, a StrongARM-1110 board, can
> > practically range from 221MHz down to 59MHz.
> 
> Requesting 100 µs and spinning only 25 µs is still a problem,
> don't you agree?

Which is why, as I've said *many* times already, that drivers are written
with leaway on the delays.

I get the impression that we're just going around in circles, and what
you're trying to do is to get me to agree with your point of view.
That's not going to happen, because I know the history over about the
last /24/ years of kernel development (which is how long I've been
involved with the kernel.)  That's almost a quarter of a century!

I know how things were done years ago (which is relevant because we
still have support in the kernel for these systems), and I also know the
history of facilities like cpufreq - I was the one who took the work
that Erik Mouw and others involved with the LART project, and turned it
into something a little more generic.  The idea of dynamically scaling
the CPU frequency on ARM SoCs was something that the SoC manufacturer
had not even considered - it was innovative.

I know that udelay() can return short delays when used in a kernel with
cpufreq enabled, and I also know that's almost an impossible problem to
solve without going to a timer-based delay.

So, when you think that sending an email about a udelay() that can be
10x shorter might be somehow new information, and might convince people
that there's a problem, I'm afraid that it isn't really new information.
The SA1110 cpufreq driver is dated 2001, and carries my copyright, and
has the ability to make udelay()s 4x shorter or 4x longer depending on
the direction of change.

We've discussed solutions in the past (probably 10 years ago) about
this, and what can be done, and the conclusion to that was, as Nicolas
has said, to switch to using a timer-based delay mechanism where
possible.  Where this is not possible, the platform is stuck with the
loops based delays, and their inherent variability and inaccuracy.

These platforms have been tested with such a setup over many years.
They work even with udelay() having this behaviour, because it's a
known issue and drivers cater for it in ways that I've already covered
in my many previous emails to you.

These issues are known.  They've been known for the last 15 odd years.

> > BTW, your example above is incorrect.
> 
> A 10x increase in frequency causes a request of 100 µs to spin
> only 10 µs, as written above.

Right, sorry, misread.
Marc Gonzalez Nov. 16, 2017, 9:05 p.m. UTC | #14
On 16/11/2017 18:05, Russell King - ARM Linux wrote:

> On Thu, Nov 16, 2017 at 05:42:36PM +0100, Marc Gonzalez wrote:
>
>> Requesting 100 µs and spinning only 25 µs is still a problem,
>> don't you agree?
> 
> Which is why, as I've said *many* times already, that drivers are written
> with leaway on the delays.

A delay 75% too short is possible. Roger that.

> I get the impression that we're just going around in circles, and what
> you're trying to do is to get me to agree with your point of view.
> That's not going to happen, because I know the history over about the
> last /24/ years of kernel development (which is how long I've been
> involved with the kernel.)  That's almost a quarter of a century!
> 
> I know how things were done years ago (which is relevant because we
> still have support in the kernel for these systems), and I also know the
> history of facilities like cpufreq - I was the one who took the work
> that Erik Mouw and others involved with the LART project, and turned it
> into something a little more generic.  The idea of dynamically scaling
> the CPU frequency on ARM SoCs was something that the SoC manufacturer
> had not even considered - it was innovative.
> 
> I know that udelay() can return short delays when used in a kernel with
> cpufreq enabled, and I also know that's almost an impossible problem to
> solve without going to a timer-based delay.
> 
> So, when you think that sending an email about a udelay() that can be
> 10x shorter might be somehow new information, and might convince people
> that there's a problem, I'm afraid that it isn't really new information.
> The SA1110 cpufreq driver is dated 2001, and carries my copyright, and
> has the ability to make udelay()s 4x shorter or 4x longer depending on
> the direction of change.
> 
> We've discussed solutions in the past (probably 10 years ago) about
> this, and what can be done, and the conclusion to that was, as Nicolas
> has said, to switch to using a timer-based delay mechanism where
> possible.  Where this is not possible, the platform is stuck with the
> loops based delays, and their inherent variability and inaccuracy.
> 
> These platforms have been tested with such a setup over many years.
> They work even with udelay() having this behaviour, because it's a
> known issue and drivers cater for it in ways that I've already covered
> in my many previous emails to you.
> 
> These issues are known.  They've been known for the last 15 odd years.

So you've known for umpteen years that fixing loop-based delays is
intractable, yet you wrote:

> udelay() needs to offer a consistent interface so that drivers know
> what to expect no matter what the implementation is.  Making one
> implementation conform to your ideas while leaving the other
> implementations with other expectations is a recipe for bugs.
> 
> If you really want to do this, fix the loops_per_jiffy implementation
> as well so that the consistency is maintained.

In other words, "I'll consider your patch as soon as Hell freezes over".

Roger that. I'll drop the subject then.
Doug Anderson Nov. 16, 2017, 10:15 p.m. UTC | #15
Hi,

On Thu, Nov 16, 2017 at 1:05 PM, Marc Gonzalez
<marc_gonzalez@sigmadesigns.com> wrote:
> On 16/11/2017 18:05, Russell King - ARM Linux wrote:
>
>> On Thu, Nov 16, 2017 at 05:42:36PM +0100, Marc Gonzalez wrote:
>>
>>> Requesting 100 µs and spinning only 25 µs is still a problem,
>>> don't you agree?
>>
>> Which is why, as I've said *many* times already, that drivers are written
>> with leaway on the delays.
>
> A delay 75% too short is possible. Roger that.
>
>> I get the impression that we're just going around in circles, and what
>> you're trying to do is to get me to agree with your point of view.
>> That's not going to happen, because I know the history over about the
>> last /24/ years of kernel development (which is how long I've been
>> involved with the kernel.)  That's almost a quarter of a century!
>>
>> I know how things were done years ago (which is relevant because we
>> still have support in the kernel for these systems), and I also know the
>> history of facilities like cpufreq - I was the one who took the work
>> that Erik Mouw and others involved with the LART project, and turned it
>> into something a little more generic.  The idea of dynamically scaling
>> the CPU frequency on ARM SoCs was something that the SoC manufacturer
>> had not even considered - it was innovative.
>>
>> I know that udelay() can return short delays when used in a kernel with
>> cpufreq enabled, and I also know that's almost an impossible problem to
>> solve without going to a timer-based delay.
>>
>> So, when you think that sending an email about a udelay() that can be
>> 10x shorter might be somehow new information, and might convince people
>> that there's a problem, I'm afraid that it isn't really new information.
>> The SA1110 cpufreq driver is dated 2001, and carries my copyright, and
>> has the ability to make udelay()s 4x shorter or 4x longer depending on
>> the direction of change.
>>
>> We've discussed solutions in the past (probably 10 years ago) about
>> this, and what can be done, and the conclusion to that was, as Nicolas
>> has said, to switch to using a timer-based delay mechanism where
>> possible.  Where this is not possible, the platform is stuck with the
>> loops based delays, and their inherent variability and inaccuracy.
>>
>> These platforms have been tested with such a setup over many years.
>> They work even with udelay() having this behaviour, because it's a
>> known issue and drivers cater for it in ways that I've already covered
>> in my many previous emails to you.
>>
>> These issues are known.  They've been known for the last 15 odd years.
>
> So you've known for umpteen years that fixing loop-based delays is
> intractable, yet you wrote:
>
>> udelay() needs to offer a consistent interface so that drivers know
>> what to expect no matter what the implementation is.  Making one
>> implementation conform to your ideas while leaving the other
>> implementations with other expectations is a recipe for bugs.
>>
>> If you really want to do this, fix the loops_per_jiffy implementation
>> as well so that the consistency is maintained.
>
> In other words, "I'll consider your patch as soon as Hell freezes over".
>
> Roger that. I'll drop the subject then.

Presumably, though, you could introduce a new API like:

  udelay_atleast()

That was guaranteed to delay at least the given number of
microseconds.  Unlike the current udelay(), the new udelay_atleast()
wouldn't really try that hard to get a delay that's approximately the
one requested, it would just guarantee not to ever delay _less_ than
the amount requested.  Thus there would be some reasons to use one API
or the other and it would be up to the driver to pick one.  You
wouldn't regress any performance on old code since the old API would
work the same as it always did.


You could presumably implement udelay_atleast() something like I did
in an ancient patch I was involved in at http://crosreview.com/189885.
In this case you wouldn't modify the normal udelay() but you'd
actually add a new API.  (for the curious, we later picked back a
proper timer-based delay in http://crosreview.com/218961 so it's not
like we were stuck with the crappy delay for long)


Once you added udelay_atleast() you'd have a good reason to fixup the
timer-based udelay() since udelay_atleast() could just be a wrapper of
that function and you'd have the requested consistency.


-Doug
Russell King (Oracle) Nov. 16, 2017, 11:22 p.m. UTC | #16
On Thu, Nov 16, 2017 at 02:15:02PM -0800, Doug Anderson wrote:
> Hi,
> 
> On Thu, Nov 16, 2017 at 1:05 PM, Marc Gonzalez
> <marc_gonzalez@sigmadesigns.com> wrote:
> > On 16/11/2017 18:05, Russell King - ARM Linux wrote:
> >
> >> On Thu, Nov 16, 2017 at 05:42:36PM +0100, Marc Gonzalez wrote:
> >>
> >>> Requesting 100 µs and spinning only 25 µs is still a problem,
> >>> don't you agree?
> >>
> >> Which is why, as I've said *many* times already, that drivers are written
> >> with leaway on the delays.
> >
> > A delay 75% too short is possible. Roger that.
> >
> >> I get the impression that we're just going around in circles, and what
> >> you're trying to do is to get me to agree with your point of view.
> >> That's not going to happen, because I know the history over about the
> >> last /24/ years of kernel development (which is how long I've been
> >> involved with the kernel.)  That's almost a quarter of a century!
> >>
> >> I know how things were done years ago (which is relevant because we
> >> still have support in the kernel for these systems), and I also know the
> >> history of facilities like cpufreq - I was the one who took the work
> >> that Erik Mouw and others involved with the LART project, and turned it
> >> into something a little more generic.  The idea of dynamically scaling
> >> the CPU frequency on ARM SoCs was something that the SoC manufacturer
> >> had not even considered - it was innovative.
> >>
> >> I know that udelay() can return short delays when used in a kernel with
> >> cpufreq enabled, and I also know that's almost an impossible problem to
> >> solve without going to a timer-based delay.
> >>
> >> So, when you think that sending an email about a udelay() that can be
> >> 10x shorter might be somehow new information, and might convince people
> >> that there's a problem, I'm afraid that it isn't really new information.
> >> The SA1110 cpufreq driver is dated 2001, and carries my copyright, and
> >> has the ability to make udelay()s 4x shorter or 4x longer depending on
> >> the direction of change.
> >>
> >> We've discussed solutions in the past (probably 10 years ago) about
> >> this, and what can be done, and the conclusion to that was, as Nicolas
> >> has said, to switch to using a timer-based delay mechanism where
> >> possible.  Where this is not possible, the platform is stuck with the
> >> loops based delays, and their inherent variability and inaccuracy.
> >>
> >> These platforms have been tested with such a setup over many years.
> >> They work even with udelay() having this behaviour, because it's a
> >> known issue and drivers cater for it in ways that I've already covered
> >> in my many previous emails to you.
> >>
> >> These issues are known.  They've been known for the last 15 odd years.
> >
> > So you've known for umpteen years that fixing loop-based delays is
> > intractable, yet you wrote:
> >
> >> udelay() needs to offer a consistent interface so that drivers know
> >> what to expect no matter what the implementation is.  Making one
> >> implementation conform to your ideas while leaving the other
> >> implementations with other expectations is a recipe for bugs.
> >>
> >> If you really want to do this, fix the loops_per_jiffy implementation
> >> as well so that the consistency is maintained.
> >
> > In other words, "I'll consider your patch as soon as Hell freezes over".
> >
> > Roger that. I'll drop the subject then.
> 
> Presumably, though, you could introduce a new API like:
> 
>   udelay_atleast()
> 
> That was guaranteed to delay at least the given number of
> microseconds.  Unlike the current udelay(), the new udelay_atleast()
> wouldn't really try that hard to get a delay that's approximately the
> one requested, it would just guarantee not to ever delay _less_ than
> the amount requested.

I look forward to reviewing your implementation.
Doug Anderson Nov. 20, 2017, 5:38 p.m. UTC | #17
Hi,

On Thu, Nov 16, 2017 at 3:22 PM, Russell King - ARM Linux
<linux@armlinux.org.uk> wrote:
> On Thu, Nov 16, 2017 at 02:15:02PM -0800, Doug Anderson wrote:
>> Hi,
>>
>> On Thu, Nov 16, 2017 at 1:05 PM, Marc Gonzalez
>> <marc_gonzalez@sigmadesigns.com> wrote:
>> > On 16/11/2017 18:05, Russell King - ARM Linux wrote:
>> >
>> >> On Thu, Nov 16, 2017 at 05:42:36PM +0100, Marc Gonzalez wrote:
>> >>
>> >>> Requesting 100 盜 and spinning only 25 盜 is still a problem,
>> >>> don't you agree?
>> >>
>> >> Which is why, as I've said *many* times already, that drivers are written
>> >> with leaway on the delays.
>> >
>> > A delay 75% too short is possible. Roger that.
>> >
>> >> I get the impression that we're just going around in circles, and what
>> >> you're trying to do is to get me to agree with your point of view.
>> >> That's not going to happen, because I know the history over about the
>> >> last /24/ years of kernel development (which is how long I've been
>> >> involved with the kernel.)  That's almost a quarter of a century!
>> >>
>> >> I know how things were done years ago (which is relevant because we
>> >> still have support in the kernel for these systems), and I also know the
>> >> history of facilities like cpufreq - I was the one who took the work
>> >> that Erik Mouw and others involved with the LART project, and turned it
>> >> into something a little more generic.  The idea of dynamically scaling
>> >> the CPU frequency on ARM SoCs was something that the SoC manufacturer
>> >> had not even considered - it was innovative.
>> >>
>> >> I know that udelay() can return short delays when used in a kernel with
>> >> cpufreq enabled, and I also know that's almost an impossible problem to
>> >> solve without going to a timer-based delay.
>> >>
>> >> So, when you think that sending an email about a udelay() that can be
>> >> 10x shorter might be somehow new information, and might convince people
>> >> that there's a problem, I'm afraid that it isn't really new information.
>> >> The SA1110 cpufreq driver is dated 2001, and carries my copyright, and
>> >> has the ability to make udelay()s 4x shorter or 4x longer depending on
>> >> the direction of change.
>> >>
>> >> We've discussed solutions in the past (probably 10 years ago) about
>> >> this, and what can be done, and the conclusion to that was, as Nicolas
>> >> has said, to switch to using a timer-based delay mechanism where
>> >> possible.  Where this is not possible, the platform is stuck with the
>> >> loops based delays, and their inherent variability and inaccuracy.
>> >>
>> >> These platforms have been tested with such a setup over many years.
>> >> They work even with udelay() having this behaviour, because it's a
>> >> known issue and drivers cater for it in ways that I've already covered
>> >> in my many previous emails to you.
>> >>
>> >> These issues are known.  They've been known for the last 15 odd years.
>> >
>> > So you've known for umpteen years that fixing loop-based delays is
>> > intractable, yet you wrote:
>> >
>> >> udelay() needs to offer a consistent interface so that drivers know
>> >> what to expect no matter what the implementation is.  Making one
>> >> implementation conform to your ideas while leaving the other
>> >> implementations with other expectations is a recipe for bugs.
>> >>
>> >> If you really want to do this, fix the loops_per_jiffy implementation
>> >> as well so that the consistency is maintained.
>> >
>> > In other words, "I'll consider your patch as soon as Hell freezes over".
>> >
>> > Roger that. I'll drop the subject then.
>>
>> Presumably, though, you could introduce a new API like:
>>
>>   udelay_atleast()
>>
>> That was guaranteed to delay at least the given number of
>> microseconds.  Unlike the current udelay(), the new udelay_atleast()
>> wouldn't really try that hard to get a delay that's approximately the
>> one requested, it would just guarantee not to ever delay _less_ than
>> the amount requested.
>
> I look forward to reviewing your implementation.

It's unlikely I'll post a patch in the near term since this isn't
presenting me with a big problem right now.  Mostly I saw Marc's patch
and thought it would be a good patch to land and I knew this type of
thing had bitten me in the past.

One happy result of this whole discussion, though, is that you now
sound as if you'll be happy the next time someone brings this up since
you're looking forward to reviewing an implementation.  That's a nice
change from the original statement questioning why someone was asking
about this again.  :)

As I've expressed in the past, IMHO fixing the timer based delays is
still a sane thing to do even without also fixing all the loop-based
implementations.  As someone said earlier in this thread, all the old
platforms using loop-based delays are working great for the platforms
they support and there's no reason to mess with them, but that doesn't
mean we shouldn't fix the problems that are easy to fix.


-Doug
Russell King (Oracle) Nov. 20, 2017, 6:31 p.m. UTC | #18
On Mon, Nov 20, 2017 at 09:38:46AM -0800, Doug Anderson wrote:
> Hi,
> 
> On Thu, Nov 16, 2017 at 3:22 PM, Russell King - ARM Linux
> <linux@armlinux.org.uk> wrote:
> > On Thu, Nov 16, 2017 at 02:15:02PM -0800, Doug Anderson wrote:
> >> Hi,
> >>
> >> On Thu, Nov 16, 2017 at 1:05 PM, Marc Gonzalez
> >> <marc_gonzalez@sigmadesigns.com> wrote:
> >> > On 16/11/2017 18:05, Russell King - ARM Linux wrote:
> >> >
> >> >> On Thu, Nov 16, 2017 at 05:42:36PM +0100, Marc Gonzalez wrote:
> >> >>
> >> >>> Requesting 100 盜 and spinning only 25 盜 is still a problem,
> >> >>> don't you agree?
> >> >>
> >> >> Which is why, as I've said *many* times already, that drivers are written
> >> >> with leaway on the delays.
> >> >
> >> > A delay 75% too short is possible. Roger that.
> >> >
> >> >> I get the impression that we're just going around in circles, and what
> >> >> you're trying to do is to get me to agree with your point of view.
> >> >> That's not going to happen, because I know the history over about the
> >> >> last /24/ years of kernel development (which is how long I've been
> >> >> involved with the kernel.)  That's almost a quarter of a century!
> >> >>
> >> >> I know how things were done years ago (which is relevant because we
> >> >> still have support in the kernel for these systems), and I also know the
> >> >> history of facilities like cpufreq - I was the one who took the work
> >> >> that Erik Mouw and others involved with the LART project, and turned it
> >> >> into something a little more generic.  The idea of dynamically scaling
> >> >> the CPU frequency on ARM SoCs was something that the SoC manufacturer
> >> >> had not even considered - it was innovative.
> >> >>
> >> >> I know that udelay() can return short delays when used in a kernel with
> >> >> cpufreq enabled, and I also know that's almost an impossible problem to
> >> >> solve without going to a timer-based delay.
> >> >>
> >> >> So, when you think that sending an email about a udelay() that can be
> >> >> 10x shorter might be somehow new information, and might convince people
> >> >> that there's a problem, I'm afraid that it isn't really new information.
> >> >> The SA1110 cpufreq driver is dated 2001, and carries my copyright, and
> >> >> has the ability to make udelay()s 4x shorter or 4x longer depending on
> >> >> the direction of change.
> >> >>
> >> >> We've discussed solutions in the past (probably 10 years ago) about
> >> >> this, and what can be done, and the conclusion to that was, as Nicolas
> >> >> has said, to switch to using a timer-based delay mechanism where
> >> >> possible.  Where this is not possible, the platform is stuck with the
> >> >> loops based delays, and their inherent variability and inaccuracy.
> >> >>
> >> >> These platforms have been tested with such a setup over many years.
> >> >> They work even with udelay() having this behaviour, because it's a
> >> >> known issue and drivers cater for it in ways that I've already covered
> >> >> in my many previous emails to you.
> >> >>
> >> >> These issues are known.  They've been known for the last 15 odd years.
> >> >
> >> > So you've known for umpteen years that fixing loop-based delays is
> >> > intractable, yet you wrote:
> >> >
> >> >> udelay() needs to offer a consistent interface so that drivers know
> >> >> what to expect no matter what the implementation is.  Making one
> >> >> implementation conform to your ideas while leaving the other
> >> >> implementations with other expectations is a recipe for bugs.
> >> >>
> >> >> If you really want to do this, fix the loops_per_jiffy implementation
> >> >> as well so that the consistency is maintained.
> >> >
> >> > In other words, "I'll consider your patch as soon as Hell freezes over".
> >> >
> >> > Roger that. I'll drop the subject then.
> >>
> >> Presumably, though, you could introduce a new API like:
> >>
> >>   udelay_atleast()
> >>
> >> That was guaranteed to delay at least the given number of
> >> microseconds.  Unlike the current udelay(), the new udelay_atleast()
> >> wouldn't really try that hard to get a delay that's approximately the
> >> one requested, it would just guarantee not to ever delay _less_ than
> >> the amount requested.
> >
> > I look forward to reviewing your implementation.
> 
> It's unlikely I'll post a patch in the near term since this isn't
> presenting me with a big problem right now.  Mostly I saw Marc's patch
> and thought it would be a good patch to land and I knew this type of
> thing had bitten me in the past.
> 
> One happy result of this whole discussion, though, is that you now
> sound as if you'll be happy the next time someone brings this up since
> you're looking forward to reviewing an implementation.  That's a nice
> change from the original statement questioning why someone was asking
> about this again.  :)

What I'd be happy with, and what I've always been happy with is what I've
stated: either we fix _all_ implementations or none of them.  We can't
have the situation where some implementations give one expectation and
others give something completely different.

That's always been my argument against _just_ fixing the timer-based
delays and ignoring the rest of the problem.

Nothing has changed about my position.
Pavel Machek Dec. 7, 2017, 12:43 p.m. UTC | #19
> On Wed, Nov 15, 2017 at 01:51:54PM +0100, Marc Gonzalez wrote:
> > On 01/11/2017 20:38, Marc Gonzalez wrote:
> > 
> > > OK, I'll just send my patch, and then crawl back under my rock.
> > 
> > Linus,
> > 
> > As promised, the patch is provided below. And as promised, I will
> > no longer bring this up on LKML.
> > 
> > FWIW, I have checked that the computed value matches the expected
> > value for all HZ and delay_us, and for a few clock frequencies,
> > using the following program:
> > 
> > $ cat delays.c
> > #include <stdio.h>
> > #define MEGA 1000000u
> > typedef unsigned int uint;
> > typedef unsigned long long u64;
> > #define DIV_ROUND_UP(n,d) (((n) + (d) - 1) / (d))
> > 
> > static const uint HZ_tab[] = { 100, 250, 300, 1000 };
> > 
> > static void check_cycle_count(uint freq, uint HZ, uint delay_us)
> > {
> > 	uint UDELAY_MULT = (2147 * HZ) + (483648 * HZ / MEGA);
> > 	uint lpj = DIV_ROUND_UP(freq, HZ);
> > 	uint computed = ((u64)lpj * delay_us * UDELAY_MULT >> 31) + 1;
> > 	uint expected = DIV_ROUND_UP((u64)delay_us * freq, MEGA);
> > 
> > 	if (computed != expected)
> > 		printf("freq=%u HZ=%u delay_us=%u comp=%u exp=%u\n", freq, HZ, delay_us, computed, expected);
> > }
> > 
> > int main(void)
> > {
> > 	uint idx, delay_us, freq;
> > 
> > 	for (freq = 3*MEGA; freq <= 100*MEGA; freq += 3*MEGA)
> > 		for (idx = 0; idx < sizeof HZ_tab / sizeof *HZ_tab; ++idx)
> > 			for (delay_us = 1; delay_us <= 2000; ++delay_us)
> > 				check_cycle_count(freq, HZ_tab[idx], delay_us);
> > 
> > 	return 0;
> > }
> > 
> > 
> > 
> > -- >8 --
> > Subject: [PATCH] ARM: Tweak clock-based udelay implementation
> > 
> > In 9f8197980d87a ("delay: Add explanation of udelay() inaccuracy")
> > Russell pointed out that loop-based delays may return early.
> > 
> > On the arm platform, delays may be either loop-based or clock-based.
> > 
> > This patch tweaks the clock-based implementation so that udelay(N)
> > is guaranteed to spin at least N microseconds.
> 
> As I've already said, I don't want this, because it encourages people
> to use too-small delays in driver code, and if we merge it then you
> will look at your data sheet, decide it says "you need to wait 10us"
> and write in your driver "udelay(10)" which will break on the loops
> based delay.
> 
> udelay() needs to offer a consistent interface so that drivers know
> what to expect no matter what the implementation is.  Making one
> implementation conform to your ideas while leaving the other
> implementations with other expectations is a recipe for bugs.

udelay() needs to be consistent across platforms, and yes, udelay(10)
is expected to delay at least 10usec.

If that is not true on your platform, _fix your platform_. But it is
not valid to reject patches fixing other platforms, just because your
platform is broken.

									Pavel
diff mbox

Patch

diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c
index 2cef11884857..0a25712077ec 100644
--- a/arch/arm/lib/delay.c
+++ b/arch/arm/lib/delay.c
@@ -58,15 +58,15 @@  static void __timer_delay(unsigned long cycles)
 {
 	cycles_t start = get_cycles();
 
-	while ((get_cycles() - start) < cycles)
+	while ((get_cycles() - start) <= cycles)
 		cpu_relax();
 }
 
 static void __timer_const_udelay(unsigned long xloops)
 {
-	unsigned long long loops = xloops;
-	loops *= arm_delay_ops.ticks_per_jiffy;
-	__timer_delay(loops >> UDELAY_SHIFT);
+	u64 tmp = (u64)xloops * arm_delay_ops.ticks_per_jiffy;
+	unsigned long cycles = tmp >> UDELAY_SHIFT;
+	__timer_delay(cycles + 1); /* Round up in 1 instruction */
 }
 
 static void __timer_udelay(unsigned long usecs)
@@ -92,7 +92,7 @@  void __init register_current_timer_delay(const struct delay_timer *timer)
 	if (!delay_calibrated && (!delay_res || (res < delay_res))) {
 		pr_info("Switching to timer-based delay loop, resolution %lluns\n", res);
 		delay_timer			= timer;
-		lpj_fine			= timer->freq / HZ;
+		lpj_fine			= DIV_ROUND_UP(timer->freq, HZ);
 		delay_res			= res;
 
 		/* cpufreq may scale loops_per_jiffy, so keep a private copy */