Message ID | 5527B331.5000205@free.fr (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi, On Fri, Apr 10, 2015 at 01:25:37PM +0200, Mason wrote: > Hello everyone, > > This is take 2 of my tiny delay.c patch > > Problem statement > > When converting microseconds to timer cycles in __timer_udelay() and > __timer_const_udelay(), the result is rounded down(*), which means the > system will not spin as long as requested (specifically, between > epsilon and 1 cycle shorter). > > If I understand correctly, most drivers expect udelay(N) to spin for > at least N µs. Is that correct? In that use case, spinning less might > introduce subtle heisenbugs. > > > Typical example > > timer->freq = 90 kHz && HZ = 100 > (thus UDELAY_MULT = 107374 && ticks_per_jiffy = 900) > > udelay(10) => __timer_const_udelay(10*107374) > => __timer_delay((1073740*900) >> 30) > => __timer_delay(0) > > So udelay(10) resolves to no delay at all. > > > (*) 2^41 / 10^6 = 2199023,255552 > 2199023 < 2^41 / 10^6 > UDELAY_MULT = 2199023*HZ / 2^11 < 2^30*HZ / 10^6 > > cycles = N * UDELAY_MULT * freq/HZ / 2^30 > < N * 2^30*HZ / 10^6 * freq/HZ / 2^30 > < N / 10^6 * freq > > > Proposed fix > > Since results are always rounded down, all we need is to increment > the result by 1 to round it up. > > Would someone ACK the patch below? > > Regards. > > > Patch against 4.0-rc4 > > diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c > index 312d43e..3cfbd07 100644 > --- a/arch/arm/lib/delay.c > +++ b/arch/arm/lib/delay.c > @@ -66,7 +66,7 @@ static void __timer_const_udelay(unsigned long xloops) > { > unsigned long long loops = xloops; > loops *= arm_delay_ops.ticks_per_jiffy; > - __timer_delay(loops >> UDELAY_SHIFT); > + __timer_delay((loops >> UDELAY_SHIFT) + 1); > } If loops is a multiple of 2 ^ UDELAY_SHIFT, then your result is too high by one. The proper way to round by excess is the following : __timer_delay((loops + (1 << UDELAY_SHIFT) - 1) >> UDELAY_SHIFT); That way it does +1 for every value of loop not an exact multiple of 2^UDELAY_SHIFT. Regards, willy
On Fri, Apr 10, 2015 at 01:25:37PM +0200, Mason wrote: > If I understand correctly, most drivers expect udelay(N) to spin for > at least N µs. Is that correct? In that use case, spinning less might > introduce subtle heisenbugs. We've never guaranteed this. The fact is that udelay() can delay for _approximately_ the time you ask for - it might be slightly shorter, or it could be much longer than you expect. On most UP implementations using the software loop it will typically be around 1% slower than requested. Adding 1us to every delay is going to be very bad. Rather than doing that, why not arrange for the rounding error to be accomodated? > Typical example > > timer->freq = 90 kHz && HZ = 100 > (thus UDELAY_MULT = 107374 && ticks_per_jiffy = 900) > > udelay(10) => __timer_const_udelay(10*107374) > => __timer_delay((1073740*900) >> 30) > => __timer_delay(0) In other words, add (1 << 30) - 1 before shifting right by 30.
On 10/04/2015 13:44, Russell King - ARM Linux wrote: > On Fri, Apr 10, 2015 at 01:25:37PM +0200, Mason wrote: >> If I understand correctly, most drivers expect udelay(N) to spin for >> at least N µs. Is that correct? In that use case, spinning less might >> introduce subtle heisenbugs. > > We've never guaranteed this. > > The fact is that udelay() can delay for _approximately_ the time you > ask for - it might be slightly shorter, or it could be much longer > than you expect. OK, but asking for 10 µs and spinning 0 is a problem, right? > On most UP implementations using the software loop > it will typically be around 1% slower than requested. Please correct any misconception, it seems to me that typical driver writers, reading "operation X takes 10 µs to complete" will write udelay(10); (if they want to spin-wait) Do you think they should take the inherent imprecision of loop-based delays into account, and add a small cushion to be safe? Also, it seems timer-based delays were introduced, in part, because they did away with the imprecision of loop-based delays on DVFS systems. > Adding 1us to every delay is going to be very bad. Rather than doing > that, why not arrange for the rounding error to be accommodated? Are you saying that my proposed patch blindly adds a 1 µs delay? In fact, it adds one /timer cycle/ (which, in my 90 kHz example, is much more than 1 µs), and this increment is just rounding up the conversion result. So if a user wants to spin N µs, and N µs is equivalent to X.7 cycles, then we delay X+1 cycles. Regards.
Hello Willy, On 10/04/2015 13:42, Willy Tarreau wrote: > Hi, > > On Fri, Apr 10, 2015 at 01:25:37PM +0200, Mason wrote: >> Hello everyone, >> >> This is take 2 of my tiny delay.c patch >> >> Problem statement >> >> When converting microseconds to timer cycles in __timer_udelay() and >> __timer_const_udelay(), the result is rounded down(*), which means the >> system will not spin as long as requested (specifically, between >> epsilon and 1 cycle shorter). >> >> If I understand correctly, most drivers expect udelay(N) to spin for >> at least N µs. Is that correct? In that use case, spinning less might >> introduce subtle heisenbugs. >> >> >> Typical example >> >> timer->freq = 90 kHz && HZ = 100 >> (thus UDELAY_MULT = 107374 && ticks_per_jiffy = 900) >> >> udelay(10) => __timer_const_udelay(10*107374) >> => __timer_delay((1073740*900) >> 30) >> => __timer_delay(0) >> >> So udelay(10) resolves to no delay at all. >> >> >> (*) 2^41 / 10^6 = 2199023,255552 >> 2199023 < 2^41 / 10^6 >> UDELAY_MULT = 2199023*HZ / 2^11 < 2^30*HZ / 10^6 >> >> cycles = N * UDELAY_MULT * freq/HZ / 2^30 >> < N * 2^30*HZ / 10^6 * freq/HZ / 2^30 >> < N / 10^6 * freq >> >> >> Proposed fix >> >> Since results are always rounded down, all we need is to increment >> the result by 1 to round it up. >> >> Would someone ACK the patch below? >> >> Regards. >> >> >> Patch against 4.0-rc4 >> >> diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c >> index 312d43e..3cfbd07 100644 >> --- a/arch/arm/lib/delay.c >> +++ b/arch/arm/lib/delay.c >> @@ -66,7 +66,7 @@ static void __timer_const_udelay(unsigned long xloops) >> { >> unsigned long long loops = xloops; >> loops *= arm_delay_ops.ticks_per_jiffy; >> - __timer_delay(loops >> UDELAY_SHIFT); >> + __timer_delay((loops >> UDELAY_SHIFT) + 1); >> } > > If loops is a multiple of 2 ^ UDELAY_SHIFT, then your result is too > high by one. The proper way to round by excess is the following : > > __timer_delay((loops + (1 << UDELAY_SHIFT) - 1) >> UDELAY_SHIFT); > > That way it does +1 for every value of loop not an exact multiple > of 2^UDELAY_SHIFT. The important thing to realize is that xloops is already rounded down, because we use 2199023 as an approximation of 2^41 / 10^6. Thus, even when 'loops' is a multiple of 2^30, we'll want to round up. Illustration timer->freq = 100*2^20 && HZ = 100 (thus UDELAY_MULT = 107374 && ticks_per_jiffy = 2^20) Suppose udelay(512) so we want to spin for 512 / 10^6 * 100*2^20 = 53687,0912 cycles i.e. 53688 cycles if we round up. loops = 512 * 107374 * 2^20 = 53687 * 2^30 If we just add (2^30-1) before shifting by 30, the result comes out to 53687, but (loops >> 30) + 1 is closer to what we really want. One might argue that the difference between 53687 and 53688 is lost in the noise and thus irrelevant. I can agree with that. Which is why I chose the simpler __timer_delay((loops >> UDELAY_SHIFT) + 1); over __timer_delay((loops + (1 << UDELAY_SHIFT) - 1) >> UDELAY_SHIFT); Do you disagree with this logic? Regards.
On Fri, Apr 10, 2015 at 02:41:37PM +0200, Mason wrote: > On 10/04/2015 13:44, Russell King - ARM Linux wrote: > >On Fri, Apr 10, 2015 at 01:25:37PM +0200, Mason wrote: > >>If I understand correctly, most drivers expect udelay(N) to spin for > >>at least N µs. Is that correct? In that use case, spinning less might > >>introduce subtle heisenbugs. > > > >We've never guaranteed this. > > > >The fact is that udelay() can delay for _approximately_ the time you > >ask for - it might be slightly shorter, or it could be much longer > >than you expect. > > OK, but asking for 10 µs and spinning 0 is a problem, right? Potentially. > >On most UP implementations using the software loop > >it will typically be around 1% slower than requested. > > Please correct any misconception, it seems to me that typical > driver writers, reading "operation X takes 10 µs to complete" > will write udelay(10); (if they want to spin-wait) Arguments do not matter here, sorry. What I said above is the way it behaves, period. It's not for discussion. It may interest you that I discussed this issue with Linus, and Linus also said that it _isn't_ a problem and it _isn't_ something we care to fix. So, like it or not, we're stuck with udelay(10) being possible to delay by 9.99us intead of 10us. Where the inaccuracy comes from is entirely down to the way we calculate loops_per_jiffy - this is the number of loop cycles between two timer interrupts - but this does _not_ take account of the time to execute a timer interrupt. So, loops_per_jiffy is the number of loop cycles between two timer interrupts minus the time taken to service the timer interrupt. And what this means is that udelay(n) where 'n' is less than the period between two timer interrupts /will/ be, and is /expected to be/ potentially shorter than the requested period. There's no getting away from that, we can't estimate how long the timer interrupt takes to handle without the use of an external timer, and if we've got an external timer, we might as well use it for all delays. > Do you think they should take the inherent imprecision of loop-based > delays into account, and add a small cushion to be safe? No. See above. Not doing that. Live with it. Fix the rounding errors if you wish, but do _not_ try to fix the "udelay() may return slightly earlier than requested" problem. I will not apply a patch to fix _that_ problem.
Hi Mason, On Fri, Apr 10, 2015 at 04:53:10PM +0200, Mason wrote: > The important thing to realize is that xloops is already rounded down, > because we use 2199023 as an approximation of 2^41 / 10^6. > > Thus, even when 'loops' is a multiple of 2^30, we'll want to round up. > > Illustration > > timer->freq = 100*2^20 && HZ = 100 > (thus UDELAY_MULT = 107374 && ticks_per_jiffy = 2^20) > > Suppose udelay(512) > so we want to spin for 512 / 10^6 * 100*2^20 = 53687,0912 cycles > i.e. 53688 cycles if we round up. > > loops = 512 * 107374 * 2^20 = 53687 * 2^30 > > If we just add (2^30-1) before shifting by 30, the result comes out > to 53687, but (loops >> 30) + 1 is closer to what we really want. > > One might argue that the difference between 53687 and 53688 is > lost in the noise and thus irrelevant. I can agree with that. > > Which is why I chose the simpler > > __timer_delay((loops >> UDELAY_SHIFT) + 1); > > over > > __timer_delay((loops + (1 << UDELAY_SHIFT) - 1) >> UDELAY_SHIFT); > > > Do you disagree with this logic? OK it seems to make sense then. Regards, Willy
Hello Russell, I appreciate (very much so) that you spend time replying to me, but I also sense a lot of animosity, and I don't know what I've done wrong to deserve it :-( On 10/04/2015 17:06, Russell King - ARM Linux wrote: > On Fri, Apr 10, 2015 at 02:41:37PM +0200, Mason wrote: >> On 10/04/2015 13:44, Russell King - ARM Linux wrote: >>> On Fri, Apr 10, 2015 at 01:25:37PM +0200, Mason wrote: >>>> If I understand correctly, most drivers expect udelay(N) to spin for >>>> at least N µs. Is that correct? In that use case, spinning less might >>>> introduce subtle heisenbugs. >>> >>> We've never guaranteed this. >>> >>> The fact is that udelay() can delay for _approximately_ the time you >>> ask for - it might be slightly shorter, or it could be much longer >>> than you expect. >> >> OK, but asking for 10 µs and spinning 0 is a problem, right? > > Potentially. > >>> On most UP implementations using the software loop >>> it will typically be around 1% slower than requested. >> >> Please correct any misconception, it seems to me that typical >> driver writers, reading "operation X takes 10 µs to complete" >> will write udelay(10); (if they want to spin-wait) > > Arguments do not matter here, sorry. What I said above is the way it > behaves, period. It's not for discussion. > > It may interest you that I discussed this issue with Linus, and Linus > also said that it _isn't_ a problem and it _isn't_ something we care > to fix. > > So, like it or not, we're stuck with udelay(10) being possible to > delay by 9.99us instead of 10us. > > Where the inaccuracy comes from is entirely down to the way we calculate > loops_per_jiffy - this is the number of loop cycles between two timer > interrupts - but this does _not_ take account of the time to execute a > timer interrupt. > > So, loops_per_jiffy is the number of loop cycles between two timer > interrupts minus the time taken to service the timer interrupt. > > And what this means is that udelay(n) where 'n' is less than the > period between two timer interrupts /will/ be, and is /expected to > be/ potentially shorter than the requested period. You've made it clear how loop-based delays are implemented; and also that loop-based delays are typically 1% shorter than requested. (Thanks for the overview, by the way.) Please note that I haven't touched to the loop-based code, I'm only discussing the timer-based code. > There's no getting away from that, we can't estimate how long the timer > interrupt takes to handle without the use of an external timer, and if > we've got an external timer, we might as well use it for all delays. Exactly! And my patch only changes __timer_const_udelay() so again I'm not touching loop-based code. >> Do you think they should take the inherent imprecision of loop-based >> delays into account, and add a small cushion to be safe? > > No. See above. Not doing that. Live with it. See, I don't understand your "Not doing that" statement. I didn't suggest changing the implementation, I asked your opinion (and the opinion of others on the list) whether driver _writers_ should take into account _in their code_ the 1% error you mentioned. Specifically, should a driver writer use udelay(101); when his spec says to spin 100 µs? (Anyway, this is just a tangential question, as I digest the ins and outs of kernel and driver development.) > Fix the rounding errors if you wish, but do _not_ try to fix the > "udelay() may return slightly earlier than requested" problem. I > will not apply a patch to fix _that_ problem. Can I submit as-is the one-line patch I proposed earlier? Regards.
On Fri, Apr 10, 2015 at 05:30:24PM +0200, Mason wrote: > I appreciate (very much so) that you spend time replying to me, > but I also sense a lot of animosity, and I don't know what I've > done wrong to deserve it :-( I'm putting the point across strongly because I really don't think there is an issue to be solved here. > On 10/04/2015 17:06, Russell King - ARM Linux wrote: > >And what this means is that udelay(n) where 'n' is less than the > >period between two timer interrupts /will/ be, and is /expected to > >be/ potentially shorter than the requested period. > > You've made it clear how loop-based delays are implemented; and also > that loop-based delays are typically 1% shorter than requested. > (Thanks for the overview, by the way.) Please note that I haven't > touched to the loop-based code, I'm only discussing the timer-based > code. 1% is a figure I pulled out of the air. It really depends on the CPU instructions per cycle, and how much work is being done in the timer interrupt handler. > >There's no getting away from that, we can't estimate how long the timer > >interrupt takes to handle without the use of an external timer, and if > >we've got an external timer, we might as well use it for all delays. > > Exactly! And my patch only changes __timer_const_udelay() so again I'm > not touching loop-based code. What I'm trying to get through to you is that udelay() as a _whole_ does not provide a guarantee that it will wait for _at least_ the time you asked for. All that it does is provide an _approximate_ delay. Yes, we can improve the timer delays to provide a guaranteed delay of at least the requested period. We _can't_ do the same for the loop-based delay. So, if we fix the timer based delays, udelay() will then have two differing expectations depending on whether it's using a timer or a loop based delay. That is bad. It's extremely poor. The expectations of an API should not change because of a different implementation, that's the way bugs happen. That's why... > >No. See above. Not doing that. Live with it. it's not a problem, and why we're not going to fix the timer code to provide a minimum guaranteed delay. > Specifically, should a driver writer use > > udelay(101); > > when his spec says to spin 100 µs? > > (Anyway, this is just a tangential question, as I digest the ins > and outs of kernel and driver development.) A driver writer should always use the required delay plus an adequate cushion to ensure that the delay required by the hardware is met. I suggest you read this: https://lkml.org/lkml/2011/1/9/37 which is the discussion I had with Linus on the point you are raising here, which is the 4th hit in google for "udelay shorter delays". Given that some udelay() implementations have been known to be as much as 50% off, that suggests using a delay value of 200 in your above example for a delay of 100µs. ARM is /relatively/ good in that regard, cpufreq and scheduling effects withstanding.
diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c index 312d43e..3cfbd07 100644 --- a/arch/arm/lib/delay.c +++ b/arch/arm/lib/delay.c @@ -66,7 +66,7 @@ static void __timer_const_udelay(unsigned long xloops) { unsigned long long loops = xloops; loops *= arm_delay_ops.ticks_per_jiffy; - __timer_delay(loops >> UDELAY_SHIFT); + __timer_delay((loops >> UDELAY_SHIFT) + 1); } static void __timer_udelay(unsigned long usecs)