Message ID | cover.1579196675.git.christophe.leroy@c-s.fr (mailing list archive) |
---|---|
Headers | show |
Series | powerpc: switch VDSO to C implementation. | expand |
Hi! On Thu, Jan 16, 2020 at 05:58:24PM +0000, Christophe Leroy wrote: > On a powerpc8xx, with current powerpc/32 ASM VDSO: > > gettimeofday: vdso: 907 nsec/call > clock-getres-realtime: vdso: 484 nsec/call > clock-gettime-realtime: vdso: 899 nsec/call > > The first patch adds VDSO generic C support without any changes to common code. > Performance is as follows: > > gettimeofday: vdso: 1211 nsec/call > clock-getres-realtime: vdso: 722 nsec/call > clock-gettime-realtime: vdso: 1216 nsec/call > > Then a few changes in the common code have allowed performance improvement. At > the end of the series we have: > > gettimeofday: vdso: 974 nsec/call > clock-getres-realtime: vdso: 545 nsec/call > clock-gettime-realtime: vdso: 941 nsec/call > > The final result is rather close to pure ASM VDSO: > * 7% more on gettimeofday (9 cycles) > * 5% more on clock-gettime-realtime (6 cycles) > * 12% more on clock-getres-realtime (8 cycles) Nice! Much better. It should be tested on more representative hardware, too, but this looks promising alright :-) Segher
Le 17/01/2020 à 09:58, Segher Boessenkool a écrit : > Hi! > > On Thu, Jan 16, 2020 at 05:58:24PM +0000, Christophe Leroy wrote: >> On a powerpc8xx, with current powerpc/32 ASM VDSO: >> >> gettimeofday: vdso: 907 nsec/call >> clock-getres-realtime: vdso: 484 nsec/call >> clock-gettime-realtime: vdso: 899 nsec/call >> >> The first patch adds VDSO generic C support without any changes to common code. >> Performance is as follows: >> >> gettimeofday: vdso: 1211 nsec/call >> clock-getres-realtime: vdso: 722 nsec/call >> clock-gettime-realtime: vdso: 1216 nsec/call >> >> Then a few changes in the common code have allowed performance improvement. At >> the end of the series we have: >> >> gettimeofday: vdso: 974 nsec/call >> clock-getres-realtime: vdso: 545 nsec/call >> clock-gettime-realtime: vdso: 941 nsec/call >> >> The final result is rather close to pure ASM VDSO: >> * 7% more on gettimeofday (9 cycles) >> * 5% more on clock-gettime-realtime (6 cycles) >> * 12% more on clock-getres-realtime (8 cycles) > > Nice! Much better. > > It should be tested on more representative hardware, too, but this looks > promising alright :-) > Yes. Now the challenge is to get VDSO32 buildable on PPC64. The big issue is that in most powerpc/include/asm/*.h , CONFIG_PPC64 is used to know if the build is a 64 bits build or a 32 bits build, so VDSO32 build fails. I don't know how this could be easily fixed. Christophe
Hi On 01/17/2020 08:58 AM, Segher Boessenkool wrote: > Hi! > > On Thu, Jan 16, 2020 at 05:58:24PM +0000, Christophe Leroy wrote: >> On a powerpc8xx, with current powerpc/32 ASM VDSO: >> >> gettimeofday: vdso: 907 nsec/call >> clock-getres-realtime: vdso: 484 nsec/call >> clock-gettime-realtime: vdso: 899 nsec/call >> >> The first patch adds VDSO generic C support without any changes to common code. >> Performance is as follows: >> >> gettimeofday: vdso: 1211 nsec/call >> clock-getres-realtime: vdso: 722 nsec/call >> clock-gettime-realtime: vdso: 1216 nsec/call >> >> Then a few changes in the common code have allowed performance improvement. At >> the end of the series we have: >> >> gettimeofday: vdso: 974 nsec/call >> clock-getres-realtime: vdso: 545 nsec/call >> clock-gettime-realtime: vdso: 941 nsec/call >> >> The final result is rather close to pure ASM VDSO: >> * 7% more on gettimeofday (9 cycles) >> * 5% more on clock-gettime-realtime (6 cycles) >> * 12% more on clock-getres-realtime (8 cycles) > > Nice! Much better. > > It should be tested on more representative hardware, too, but this looks > promising alright :-) > mpc832x (e300c2 core) at 333 MHz: Before: gettimeofday: vdso: 235 nsec/call clock-getres-realtime-coarse: vdso: 1668 nsec/call clock-gettime-realtime-coarse: vdso: 1338 nsec/call clock-getres-realtime: vdso: 135 nsec/call clock-gettime-realtime: vdso: 244 nsec/call clock-getres-boottime: vdso: 1232 nsec/call clock-gettime-boottime: vdso: 1935 nsec/call clock-getres-tai: vdso: 1257 nsec/call clock-gettime-tai: vdso: 1898 nsec/call clock-getres-monotonic-raw: vdso: 1229 nsec/call clock-gettime-monotonic-raw: vdso: 1541 nsec/call clock-getres-monotonic-coarse: vdso: 1699 nsec/call clock-gettime-monotonic-coarse: vdso: 1477 nsec/call clock-getres-monotonic: vdso: 135 nsec/call clock-gettime-monotonic: vdso: 283 nsec/call With the series: gettimeofday: vdso: 271 nsec/call clock-getres-realtime-coarse: vdso: 159 nsec/call clock-gettime-realtime-coarse: vdso: 184 nsec/call clock-getres-realtime: vdso: 163 nsec/call clock-gettime-realtime: vdso: 281 nsec/call clock-getres-boottime: vdso: 169 nsec/call clock-gettime-boottime: vdso: 274 nsec/call clock-getres-tai: vdso: 163 nsec/call clock-gettime-tai: vdso: 277 nsec/call clock-getres-monotonic-raw: vdso: 166 nsec/call clock-gettime-monotonic-raw: vdso: 302 nsec/call clock-getres-monotonic-coarse: vdso: 159 nsec/call clock-gettime-monotonic-coarse: vdso: 184 nsec/call clock-getres-monotonic: vdso: 166 nsec/call clock-gettime-monotonic: vdso: 274 nsec/call Christophe
On Mon, Jan 20, 2020 at 02:56:00PM +0000, Christophe Leroy wrote: > >Nice! Much better. > > > >It should be tested on more representative hardware, too, but this looks > >promising alright :-) > > mpc832x (e300c2 core) at 333 MHz: > > Before: > > gettimeofday: vdso: 235 nsec/call > clock-gettime-realtime: vdso: 244 nsec/call > > With the series: > > gettimeofday: vdso: 271 nsec/call > clock-gettime-realtime: vdso: 281 nsec/call Those are important, and degrade ~15%. That is acceptable IMO, but do you see a way to optimise this (later)? Anyway, excellent results, thanks for your persistence! Segher
Le 20/01/2020 à 16:19, Segher Boessenkool a écrit : > On Mon, Jan 20, 2020 at 02:56:00PM +0000, Christophe Leroy wrote: >>> Nice! Much better. >>> >>> It should be tested on more representative hardware, too, but this looks >>> promising alright :-) >> >> mpc832x (e300c2 core) at 333 MHz: >> >> Before: >> >> gettimeofday: vdso: 235 nsec/call >> clock-gettime-realtime: vdso: 244 nsec/call >> >> With the series: >> >> gettimeofday: vdso: 271 nsec/call >> clock-gettime-realtime: vdso: 281 nsec/call > > Those are important, and degrade ~15%. That is acceptable IMO, but do > you see a way to optimise this (later)? Not easy I think. First we have the unavoidable ASM entry function that can't be dropped because of the CR[SO] bit the set on error or clear on no error and that can't be done in C. In our ASM VDSO, fixed shifts are used, while in generic C VDSO, shifts are generic and read from the VDSO data. And there is still some funny code generated by GCC (8.1), like: 620: 7d 29 3c 30 srw r9,r9,r7 624: 21 87 00 20 subfic r12,r7,32 628: 7d 07 3c 31 srw. r7,r8,r7 62c: 7d 08 60 30 slw r8,r8,r12 630: 7d 0b 4b 78 or r11,r8,r9 634: 39 40 00 00 li r10,0 638: 40 82 00 84 bne 6bc <__c_kernel_clock_gettime+0x114> 63c: 81 23 00 24 lwz r9,36(r3) 640: 81 05 00 00 lwz r8,0(r5) ... 6bc: 7d 69 5b 78 mr r9,r11 6c0: 7c ea 3b 78 mr r10,r7 6c4: 7d 2b 4b 78 mr r11,r9 6c8: 4b ff ff 74 b 63c <__c_kernel_clock_gettime+0x94> This branch to 6bc is totally useless: - copying r11 into r9 is pointless as r9 is overwritten in 63c - copying back r9 into r11 is pointless as r11 has not been modified inbetween. - loading r10 with 0 then overwritting r10 with r7 when r7 is not 0 is pointless as well, could have directly put the result of srw. in r10. Christophe
On Mon, Jan 20, 2020 at 06:08:23PM +0100, Christophe Leroy wrote: > Not easy I think. > > First we have the unavoidable ASM entry function that can't be dropped > because of the CR[SO] bit the set on error or clear on no error and that > can't be done in C. Yup. > In our ASM VDSO, fixed shifts are used, while in generic C VDSO, shifts > are generic and read from the VDSO data. Does that cost more than just a few cycles? > And there is still some funny code generated by GCC (8.1), like: > > 620: 7d 29 3c 30 srw r9,r9,r7 > 624: 21 87 00 20 subfic r12,r7,32 > 628: 7d 07 3c 31 srw. r7,r8,r7 > 62c: 7d 08 60 30 slw r8,r8,r12 > 630: 7d 0b 4b 78 or r11,r8,r9 (This can be done cheaper for fixed shifts, you can use rlwimi then). > 634: 39 40 00 00 li r10,0 > 638: 40 82 00 84 bne 6bc <__c_kernel_clock_gettime+0x114> > 63c: 81 23 00 24 lwz r9,36(r3) > 640: 81 05 00 00 lwz r8,0(r5) > ... > 6bc: 7d 69 5b 78 mr r9,r11 > 6c0: 7c ea 3b 78 mr r10,r7 > 6c4: 7d 2b 4b 78 mr r11,r9 > 6c8: 4b ff ff 74 b 63c <__c_kernel_clock_gettime+0x94> > > This branch to 6bc is totally useless: > - copying r11 into r9 is pointless as r9 is overwritten in 63c > - copying back r9 into r11 is pointless as r11 has not been modified > inbetween. Yeah, huh, how did that happen. > - loading r10 with 0 then overwritting r10 with r7 when r7 is not 0 is > pointless as well, could have directly put the result of srw. in r10. This may be harder to make the compiler do. But the r9/r11 thing suggests you are preventing optimisation somewhere, maybe with some asm? Do you have some small testcase I can compile? Segher