Message ID | 20240329091245.135216-1-richard120310@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | blk-wbt: Speed up integer square root in rwb_arm_timer | expand |
On 3/29/24 2:12 AM, I Hsin Cheng wrote: > As the result shown, the origin version of integer square root, which is > "int_sqrt" takes 35.37 msec task-clock, 1,2181,3348 cycles, 1,6095,3665 > instructions, 2551,2990 branches and causes 1,0616 branch-misses. > > At the same time, the variant version of integer square root, which is > "int_fastsqrt" takes 33.96 msec task-clock, 1,1645,7487 cyclces, > 5621,0086 instructions, 321,0409 branches and causes 2407 branch-misses. > We can clearly see that "int_fastsqrt" performs faster and better result > so it's indeed a faster invariant of integer square root. I'm not sure that a 4% performance improvement is sufficient to replace the int_sqrt() implementation. Additionally, why to add a second implementation of int_sqrt() instead of replacing the int_sqrt() implementation in lib/math/int_sqrt.c? > The experiments runs on x86_64 GNU/Linux Architecture and the CPU is > Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz. Since int_sqrt() does not use divisions and since int_fastsqrt() uses divisions, can all CPUs supported by the Linux kernel divide numbers as quickly as the CPU mentioned above? Thanks, Bart.
On 3/29/24 12:15 PM, Bart Van Assche wrote: > On 3/29/24 2:12 AM, I Hsin Cheng wrote: >> As the result shown, the origin version of integer square root, which is >> "int_sqrt" takes 35.37 msec task-clock, 1,2181,3348 cycles, 1,6095,3665 >> instructions, 2551,2990 branches and causes 1,0616 branch-misses. >> >> At the same time, the variant version of integer square root, which is >> "int_fastsqrt" takes 33.96 msec task-clock, 1,1645,7487 cyclces, >> 5621,0086 instructions, 321,0409 branches and causes 2407 branch-misses. >> We can clearly see that "int_fastsqrt" performs faster and better result >> so it's indeed a faster invariant of integer square root. > > I'm not sure that a 4% performance improvement is sufficient to > replace the int_sqrt() implementation. Additionally, why to add a > second implementation of int_sqrt() instead of replacing the > int_sqrt() implementation in lib/math/int_sqrt.c? That's the real question imho - if provides the same numbers and is faster, why have two? I ran a quick test because I was curious, and the precision is definitely worse. The claim that it is floor(sqrt(val)) is not true. Trivial example: 1005117225 sqrt() 31703.58 int_sqrt() 30703 int_fastsqrt() 30821 whether this matters, probably not, but then again it's hard to care about a slow path sqrt calculation. I'd certainly err on the side of precision for that.
The last email didn't follow the plain-text format, I'm sorry for that, that's why I resend it. Sorry for the bother. >> Additionally, why to add a second >> implementation of int_sqrt() instead of replacing the int_sqrt() >> implementation in lib/math/int_sqrt.c? I was thinking about adding an alternative option first rather than replace the whole int_sqrt() function which is used in many other parts in the Linux kernel. >> Since int_sqrt() does not use divisions and since int_fastsqrt() uses >> divisions, can all CPUs supported by the Linux kernel divide numbers as >> quickly as the CPU mentioned above? You're right about that. Thanks for pointing out the problem, I'll try to replace the divisions maybe with another kind of approximation method. > The claim that it is floor(sqrt(val)) is not true. > Trivial example: > > 1005117225 > sqrt() 31703.58 > int_sqrt() 30703 > int_fastsqrt() 30821 Thanks for pointing out the problem, I only compare my method with int_sqrt() and plot the result using gnuplot, the result shown that they gave very very close answers, but I didn't count the error based on the integer part of sqrt(), which is indeed necessary. Sorry for this part. I'll check on the precision of my method again. Thanks for your patience and time on reviewing my patch. Best Regards, I Hsin Cheng. On Sat, Mar 30, 2024 at 4:29 PM 鄭以新 <richard120310@gmail.com> wrote: > > >> Additionally, why to add a second > >> implementation of int_sqrt() instead of replacing the int_sqrt() > >> implementation in lib/math/int_sqrt.c? > > I was thinking about adding an alternative option first rather than > replace the whole int_sqrt() function which is used in many other > parts in the Linux kernel. > > >> Since int_sqrt() does not use divisions and since int_fastsqrt() uses > >> divisions, can all CPUs supported by the Linux kernel divide numbers as > >> quickly as the CPU mentioned above? > > You're right about that. Thanks for pointing out the problem, I'll try to > replace the divisions maybe with another kind of approximation method. > > > The claim that it is floor(sqrt(val)) is not true. > > Trivial example: > > > > 1005117225 > > sqrt() 31703.58 > > int_sqrt() 30703 > > int_fastsqrt() 30821 > > Thanks for pointing out the problem, I only compare my method with int_sqrt() > and plot the result using gnuplot, the result shown that they gave very very close > answers, but I didn't count the error based on the integer part of sqrt(), which is > indeed necessary. Sorry for this part. I'll check on the precision of my method > again. > > Thanks for your patience and time on reviewing my patch. > > Best Regards, > > I Hsin Cheng. > > Jens Axboe <axboe@kernel.dk> 於 2024年3月30日 週六 上午3:12寫道: >> >> On 3/29/24 12:15 PM, Bart Van Assche wrote: >> > On 3/29/24 2:12 AM, I Hsin Cheng wrote: >> >> As the result shown, the origin version of integer square root, which is >> >> "int_sqrt" takes 35.37 msec task-clock, 1,2181,3348 cycles, 1,6095,3665 >> >> instructions, 2551,2990 branches and causes 1,0616 branch-misses. >> >> >> >> At the same time, the variant version of integer square root, which is >> >> "int_fastsqrt" takes 33.96 msec task-clock, 1,1645,7487 cyclces, >> >> 5621,0086 instructions, 321,0409 branches and causes 2407 branch-misses. >> >> We can clearly see that "int_fastsqrt" performs faster and better result >> >> so it's indeed a faster invariant of integer square root. >> > >> > I'm not sure that a 4% performance improvement is sufficient to >> > replace the int_sqrt() implementation. Additionally, why to add a >> > second implementation of int_sqrt() instead of replacing the >> > int_sqrt() implementation in lib/math/int_sqrt.c? >> >> That's the real question imho - if provides the same numbers and is >> faster, why have two? >> >> I ran a quick test because I was curious, and the precision is >> definitely worse. The claim that it is floor(sqrt(val)) is not true. >> Trivial example: >> >> 1005117225 >> sqrt() 31703.58 >> int_sqrt() 30703 >> int_fastsqrt() 30821 >> >> whether this matters, probably not, but then again it's hard to care >> about a slow path sqrt calculation. I'd certainly err on the side of >> precision for that. >> >> -- >> Jens Axboe >>
diff --git a/block/blk-wbt.c b/block/blk-wbt.c index 0bb613139..8d25c0e55 100644 --- a/block/blk-wbt.c +++ b/block/blk-wbt.c @@ -407,7 +407,7 @@ static void rwb_arm_timer(struct rq_wb *rwb) * though. */ rwb->cur_win_nsec = div_u64(rwb->win_nsec << 4, - int_sqrt((rqd->scale_step + 1) << 8)); + int_fastsqrt((rqd->scale_step + 1) << 8)); } else { /* * For step < 0, we don't want to increase/decrease the diff --git a/lib/math/int_sqrt.c b/lib/math/int_sqrt.c index a8170bb91..e25ba179e 100644 --- a/lib/math/int_sqrt.c +++ b/lib/math/int_sqrt.c @@ -40,6 +40,27 @@ unsigned long int_sqrt(unsigned long x) } EXPORT_SYMBOL(int_sqrt); + +/** + * int_fastsqrt - faster invariant of int_sqrt + * @x: integer of which to calculate the sqrt + * + * Computes: floor(sqrt(x)) + */ + +unsigned long int_fastsqrt(unsigned long x) +{ + unsigned long y = ((31 - __builtin_clz(x | 1)) >> 1); + unsigned long z = 1 << y; + + z = (z + (x / z)) >> 1; + z = (z + (x / z)) >> 1; + z = (z + (x / z)) >> 1; + + return z; +} +EXPORT_SYMBOL(int_fastsqrt); + #if BITS_PER_LONG < 64 /** * int_sqrt64 - strongly typed int_sqrt function when minimum 64 bit input
As the comments in rwb_arm_timer says, we should speed up the process of integer square root with some variant versions. The implementation of the variant integer square root "int_fast_sqrt()" is based on the equation $lg(\sqrt{x}) = lg(x) / 2$ , where "lg" stands for logarithm with base 2. After we take the first approximation by calculate $2^{lg(x)/2}$ , we utilize Newton's method for three times in order to enhance the precision. To prove "int_fastsqrt" is indeed faster than the "int_sqrt", we take some bench experiments to calculate the integer logarithm from 1 to 10^6 and use "perf stat --repeat 100" to measure the performance of each function. As the result shown, the origin version of integer square root, which is "int_sqrt" takes 35.37 msec task-clock, 1,2181,3348 cycles, 1,6095,3665 instructions, 2551,2990 branches and causes 1,0616 branch-misses. At the same time, the variant version of integer square root, which is "int_fastsqrt" takes 33.96 msec task-clock, 1,1645,7487 cyclces, 5621,0086 instructions, 321,0409 branches and causes 2407 branch-misses. We can clearly see that "int_fastsqrt" performs faster and better result so it's indeed a faster invariant of integer square root. The experiments runs on x86_64 GNU/Linux Architecture and the CPU is Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz. Signed-off-by: I Hsin Cheng <richard120310@gmail.com> --- block/blk-wbt.c | 2 +- lib/math/int_sqrt.c | 21 +++++++++++++++++++++ 2 files changed, 22 insertions(+), 1 deletion(-)