Message ID | 20200512124450.824507755@infradead.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | arm64: perf: Proper cap_user_time* support | expand |
Hi Peter, On Tue, May 12, 2020 at 02:41:00PM +0200, Peter Zijlstra wrote: > As reported by Leo; the existing implementation is broken when the > clock and counter don't intersect at 0. > > Use the sched_clock's struct clock_read_data information to correctly > implement cap_user_time and cap_user_time_zero. > > Note that the ARM64 counter is architecturally only guaranteed to be > 56bit wide (implementations are allowed to be wider) and the existing > perf ABI cannot deal with wrap-around. > > This implementation should also be faster than the old; seeing how we > don't need to recompute mult and shift all the time. > > Reported-by: Leo Yan <leo.yan@linaro.org> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > arch/arm64/kernel/perf_event.c | 36 +++++++++++++++++++++++++++--------- > 1 file changed, 27 insertions(+), 9 deletions(-) > > --- a/arch/arm64/kernel/perf_event.c > +++ b/arch/arm64/kernel/perf_event.c > @@ -19,6 +19,7 @@ > #include <linux/of.h> > #include <linux/perf/arm_pmu.h> > #include <linux/platform_device.h> > +#include <linux/sched_clock.h> > #include <linux/smp.h> > > /* ARMv8 Cortex-A53 specific event types. */ > @@ -1165,28 +1166,45 @@ device_initcall(armv8_pmu_driver_init) > void arch_perf_update_userpage(struct perf_event *event, > struct perf_event_mmap_page *userpg, u64 now) > { > - u32 freq; > - u32 shift; > + struct clock_read_data *rd; > + unsigned int seq; > > /* > * Internal timekeeping for enabled/running/stopped times > * is always computed with the sched_clock. > */ > - freq = arch_timer_get_rate(); > userpg->cap_user_time = 1; > + userpg->cap_user_time_zero = 1; > + > + do { > + rd = sched_clock_read_begin(&seq); > + > + userpg->time_mult = rd->mult; > + userpg->time_shift = rd->shift; > + userpg->time_zero = rd->epoch_ns; > + > + /* > + * This isn't strictly correct, the ARM64 counter can be > + * 'short' and then we get funnies when it wraps. The correct > + * thing would be to extend the perf ABI with a cycle and mask > + * value, but because wrapping on ARM64 is very rare in > + * practise this 'works'. > + */ > + userpg->time_zero -= (rd->epoch_cyc * rd->mult) >> rd->shift; > + > + } while (sched_clock_read_retry(seq)); > + > + userpg->time_offset = userpg->time_zero - now; > > - clocks_calc_mult_shift(&userpg->time_mult, &shift, freq, > - NSEC_PER_SEC, 0); > /* > * time_shift is not expected to be greater than 31 due to > * the original published conversion algorithm shifting a > * 32-bit value (now specifies a 64-bit value) - refer > * perf_event_mmap_page documentation in perf_event.h. > */ > - if (shift == 32) { > - shift = 31; > + if (userpg->shift == 32) { Thanks a lot for the patch set, some typos: s/shift/time_shift > + userpg->shift = 31; s/shift/time_shift Thanks, Leo > userpg->time_mult >>= 1; > } > - userpg->time_shift = (u16)shift; > - userpg->time_offset = -now; > + > } > >
On Tue, May 12, 2020 at 10:03:01PM +0800, Leo Yan wrote: > > + if (userpg->shift == 32) { > > Thanks a lot for the patch set, some typos: > > s/shift/time_shift > > > + userpg->shift = 31; > > s/shift/time_shift Blergh.. so much for me not taking the time to dig out the arm64 cross compiler :/ Sorry about that.
Hi Peter, I love your patch! Yet something to improve: [auto build test ERROR on tip/perf/core] [also build test ERROR on arm64/for-next/core arm-perf/for-next/perf linus/master v5.7-rc5 next-20200512] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system. BTW, we also suggest to use '--base' option to specify the base tree in git format-patch, please see https://stackoverflow.com/a/37406982] url: https://github.com/0day-ci/linux/commits/Peter-Zijlstra/arm64-perf-Proper-cap_user_time-support/20200512-205141 base: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 059c6d68cfc5f85ba3ab71d71a6de380016f7936 config: arm64-allyesconfig (attached as .config) compiler: aarch64-linux-gcc (GCC) 9.3.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day GCC_VERSION=9.3.0 make.cross ARCH=arm64 If you fix the issue, kindly add following tag as appropriate Reported-by: kbuild test robot <lkp@intel.com> All errors (new ones prefixed by >>): arch/arm64/kernel/perf_event.c: In function 'arch_perf_update_userpage': >> arch/arm64/kernel/perf_event.c:1205:12: error: 'struct perf_event_mmap_page' has no member named 'shift' 1205 | if (userpg->shift == 32) { | ^~ arch/arm64/kernel/perf_event.c:1206:9: error: 'struct perf_event_mmap_page' has no member named 'shift' 1206 | userpg->shift = 31; | ^~ vim +1205 arch/arm64/kernel/perf_event.c 1165 1166 void arch_perf_update_userpage(struct perf_event *event, 1167 struct perf_event_mmap_page *userpg, u64 now) 1168 { 1169 struct clock_read_data *rd; 1170 unsigned int seq; 1171 1172 /* 1173 * Internal timekeeping for enabled/running/stopped times 1174 * is always computed with the sched_clock. 1175 */ 1176 userpg->cap_user_time = 1; 1177 userpg->cap_user_time_zero = 1; 1178 1179 do { 1180 rd = sched_clock_read_begin(&seq); 1181 1182 userpg->time_mult = rd->mult; 1183 userpg->time_shift = rd->shift; 1184 userpg->time_zero = rd->epoch_ns; 1185 1186 /* 1187 * This isn't strictly correct, the ARM64 counter can be 1188 * 'short' and then we get funnies when it wraps. The correct 1189 * thing would be to extend the perf ABI with a cycle and mask 1190 * value, but because wrapping on ARM64 is very rare in 1191 * practise this 'works'. 1192 */ 1193 userpg->time_zero -= (rd->epoch_cyc * rd->mult) >> rd->shift; 1194 1195 } while (sched_clock_read_retry(seq)); 1196 1197 userpg->time_offset = userpg->time_zero - now; 1198 1199 /* 1200 * time_shift is not expected to be greater than 31 due to 1201 * the original published conversion algorithm shifting a 1202 * 32-bit value (now specifies a 64-bit value) - refer 1203 * perf_event_mmap_page documentation in perf_event.h. 1204 */ > 1205 if (userpg->shift == 32) { 1206 userpg->shift = 31; 1207 userpg->time_mult >>= 1; 1208 } 1209 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
--- a/arch/arm64/kernel/perf_event.c +++ b/arch/arm64/kernel/perf_event.c @@ -19,6 +19,7 @@ #include <linux/of.h> #include <linux/perf/arm_pmu.h> #include <linux/platform_device.h> +#include <linux/sched_clock.h> #include <linux/smp.h> /* ARMv8 Cortex-A53 specific event types. */ @@ -1165,28 +1166,45 @@ device_initcall(armv8_pmu_driver_init) void arch_perf_update_userpage(struct perf_event *event, struct perf_event_mmap_page *userpg, u64 now) { - u32 freq; - u32 shift; + struct clock_read_data *rd; + unsigned int seq; /* * Internal timekeeping for enabled/running/stopped times * is always computed with the sched_clock. */ - freq = arch_timer_get_rate(); userpg->cap_user_time = 1; + userpg->cap_user_time_zero = 1; + + do { + rd = sched_clock_read_begin(&seq); + + userpg->time_mult = rd->mult; + userpg->time_shift = rd->shift; + userpg->time_zero = rd->epoch_ns; + + /* + * This isn't strictly correct, the ARM64 counter can be + * 'short' and then we get funnies when it wraps. The correct + * thing would be to extend the perf ABI with a cycle and mask + * value, but because wrapping on ARM64 is very rare in + * practise this 'works'. + */ + userpg->time_zero -= (rd->epoch_cyc * rd->mult) >> rd->shift; + + } while (sched_clock_read_retry(seq)); + + userpg->time_offset = userpg->time_zero - now; - clocks_calc_mult_shift(&userpg->time_mult, &shift, freq, - NSEC_PER_SEC, 0); /* * time_shift is not expected to be greater than 31 due to * the original published conversion algorithm shifting a * 32-bit value (now specifies a 64-bit value) - refer * perf_event_mmap_page documentation in perf_event.h. */ - if (shift == 32) { - shift = 31; + if (userpg->shift == 32) { + userpg->shift = 31; userpg->time_mult >>= 1; } - userpg->time_shift = (u16)shift; - userpg->time_offset = -now; + }
As reported by Leo; the existing implementation is broken when the clock and counter don't intersect at 0. Use the sched_clock's struct clock_read_data information to correctly implement cap_user_time and cap_user_time_zero. Note that the ARM64 counter is architecturally only guaranteed to be 56bit wide (implementations are allowed to be wider) and the existing perf ABI cannot deal with wrap-around. This implementation should also be faster than the old; seeing how we don't need to recompute mult and shift all the time. Reported-by: Leo Yan <leo.yan@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- arch/arm64/kernel/perf_event.c | 36 +++++++++++++++++++++++++++--------- 1 file changed, 27 insertions(+), 9 deletions(-)