Message ID | 20200320093545.28227-1-leo.yan@linaro.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | arm64: perf_event: Fix time_offset for arch timer | expand |
On Fri, Mar 20, 2020 at 05:35:45PM +0800, Leo Yan wrote: > Between the system powering on and kernel's sched clock registration, > the arch timer usually has been enabled at the early time and its > counter is incremented during the period of the booting up. Thus the > arch timer's counter is not completely accounted into the sched clock, > and has a delta between the arch timer's counter and sched clock. This > delta value should be stored into userpg->time_offset, which later can > be retrieved by Perf tool in the user space for sample timestamp > calculation. > > Now userpg->time_offset is assigned to the negative sched clock with > '-now', this value cannot reflect the delta between arch timer's counter > and sched clock, so Perf cannot use it to calculate the sample time. > > To fix this issue, this patch calculate the delta between the arch > timer's and sched clock and assign the delta to userpg->time_offset. > The detailed steps are firstly to convert counter to nanoseconds 'ns', > then the offset is calculated as 'now' minus 'ns'. > > |<------------------- 'ns' ---------------------->| > |<-------- 'now' -------->| > |<---- time_offset ---->| > |-----------------------|-------------------------| > ^ ^ ^ > Power on system sched clock registration Perf starts > > Signed-off-by: Leo Yan <leo.yan@linaro.org> > --- > arch/arm64/kernel/perf_event.c | 19 ++++++++++++++++++- > 1 file changed, 18 insertions(+), 1 deletion(-) Gentle ping ... Hi Mike R., Peter, If possible, could you give a look for this patch? Thank you, Leo > > diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c > index e40b65645c86..226d25d77072 100644 > --- a/arch/arm64/kernel/perf_event.c > +++ b/arch/arm64/kernel/perf_event.c > @@ -1143,6 +1143,7 @@ void arch_perf_update_userpage(struct perf_event *event, > { > u32 freq; > u32 shift; > + u64 count, ns, quot, rem; > > /* > * Internal timekeeping for enabled/running/stopped times > @@ -1164,5 +1165,21 @@ void arch_perf_update_userpage(struct perf_event *event, > userpg->time_mult >>= 1; > } > userpg->time_shift = (u16)shift; > - userpg->time_offset = -now; > + > + /* > + * Since arch timer is enabled ealier than sched clock registration, > + * compuate the delta (in nanosecond unit) between the arch timer > + * counter and sched clock, assign the delta to time_offset and > + * perf tool can use it for timestamp calculation. > + * > + * The formula for conversion arch timer cycle to ns is: > + * quot = (cyc >> time_shift); > + * rem = cyc & ((1 << time_shift) - 1); > + * ns = quot * time_mult + ((rem * time_mult) >> time_shift); > + */ > + count = arch_timer_read_counter(); > + quot = count >> shift; > + rem = count & ((1 << shift) - 1); > + ns = quot * userpg->time_mult + ((rem * userpg->time_mult) >> shift); > + userpg->time_offset = now - ns; > } > -- > 2.17.1 >
Hi Leo, [+Maz and tglx in case I'm barking up the wrong tree] On Fri, Mar 20, 2020 at 05:35:45PM +0800, Leo Yan wrote: > Between the system powering on and kernel's sched clock registration, > the arch timer usually has been enabled at the early time and its > counter is incremented during the period of the booting up. Thus the > arch timer's counter is not completely accounted into the sched clock, > and has a delta between the arch timer's counter and sched clock. This > delta value should be stored into userpg->time_offset, which later can > be retrieved by Perf tool in the user space for sample timestamp > calculation. > > Now userpg->time_offset is assigned to the negative sched clock with > '-now', this value cannot reflect the delta between arch timer's counter > and sched clock, so Perf cannot use it to calculate the sample time. > > To fix this issue, this patch calculate the delta between the arch > timer's and sched clock and assign the delta to userpg->time_offset. > The detailed steps are firstly to convert counter to nanoseconds 'ns', > then the offset is calculated as 'now' minus 'ns'. > > |<------------------- 'ns' ---------------------->| > |<-------- 'now' -------->| > |<---- time_offset ---->| > |-----------------------|-------------------------| > ^ ^ ^ > Power on system sched clock registration Perf starts FWIW, I'm /really/ struggling to understand the problem here. If I've grokked it correctly (big 'if'), then you can't just factor in what you call "time_offset" in the diagram above, because there isn't a guarantee that the counter is zero-initialised at the start. > Signed-off-by: Leo Yan <leo.yan@linaro.org> > --- > arch/arm64/kernel/perf_event.c | 19 ++++++++++++++++++- > 1 file changed, 18 insertions(+), 1 deletion(-) > > diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c > index e40b65645c86..226d25d77072 100644 > --- a/arch/arm64/kernel/perf_event.c > +++ b/arch/arm64/kernel/perf_event.c > @@ -1143,6 +1143,7 @@ void arch_perf_update_userpage(struct perf_event *event, > { > u32 freq; > u32 shift; > + u64 count, ns, quot, rem; > > /* > * Internal timekeeping for enabled/running/stopped times > @@ -1164,5 +1165,21 @@ void arch_perf_update_userpage(struct perf_event *event, > userpg->time_mult >>= 1; > } > userpg->time_shift = (u16)shift; > - userpg->time_offset = -now; > + > + /* > + * Since arch timer is enabled ealier than sched clock registration, > + * compuate the delta (in nanosecond unit) between the arch timer > + * counter and sched clock, assign the delta to time_offset and > + * perf tool can use it for timestamp calculation. > + * > + * The formula for conversion arch timer cycle to ns is: > + * quot = (cyc >> time_shift); > + * rem = cyc & ((1 << time_shift) - 1); > + * ns = quot * time_mult + ((rem * time_mult) >> time_shift); > + */ > + count = arch_timer_read_counter(); > + quot = count >> shift; > + rem = count & ((1 << shift) - 1); > + ns = quot * userpg->time_mult + ((rem * userpg->time_mult) >> shift); > + userpg->time_offset = now - ns; Hmm, reading the counter and calculating the delta feels horribly approximate to me. It would be much better if we could get hold of the initial epoch cycles from the point at which sched_clock was initialised using the counter. This represents the true cycle delta between the counter and what sched_clock uses for 0 ns. Unfortunately, I can't see a straightforward way to grab that information. It looks like x86 pulls this directly from the TSC driver. Will
On 2020-04-30 15:58, Will Deacon wrote: > Hi Leo, > > [+Maz and tglx in case I'm barking up the wrong tree] > > On Fri, Mar 20, 2020 at 05:35:45PM +0800, Leo Yan wrote: >> Between the system powering on and kernel's sched clock registration, >> the arch timer usually has been enabled at the early time and its >> counter is incremented during the period of the booting up. Thus the >> arch timer's counter is not completely accounted into the sched clock, >> and has a delta between the arch timer's counter and sched clock. >> This >> delta value should be stored into userpg->time_offset, which later can >> be retrieved by Perf tool in the user space for sample timestamp >> calculation. >> >> Now userpg->time_offset is assigned to the negative sched clock with >> '-now', this value cannot reflect the delta between arch timer's >> counter >> and sched clock, so Perf cannot use it to calculate the sample time. >> >> To fix this issue, this patch calculate the delta between the arch >> timer's and sched clock and assign the delta to userpg->time_offset. >> The detailed steps are firstly to convert counter to nanoseconds 'ns', >> then the offset is calculated as 'now' minus 'ns'. >> >> |<------------------- 'ns' ---------------------->| >> |<-------- 'now' -------->| >> |<---- time_offset ---->| >> |-----------------------|-------------------------| >> ^ ^ ^ >> Power on system sched clock registration Perf starts > > FWIW, I'm /really/ struggling to understand the problem here. > > If I've grokked it correctly (big 'if'), then you can't just factor in > what you call "time_offset" in the diagram above, because there isn't > a guarantee that the counter is zero-initialised at the start. Even if it was, we have no idea of *when* that was. Think kexec, for a start. Or spending some variable in firmware because of $REASON. > >> Signed-off-by: Leo Yan <leo.yan@linaro.org> >> --- >> arch/arm64/kernel/perf_event.c | 19 ++++++++++++++++++- >> 1 file changed, 18 insertions(+), 1 deletion(-) >> >> diff --git a/arch/arm64/kernel/perf_event.c >> b/arch/arm64/kernel/perf_event.c >> index e40b65645c86..226d25d77072 100644 >> --- a/arch/arm64/kernel/perf_event.c >> +++ b/arch/arm64/kernel/perf_event.c >> @@ -1143,6 +1143,7 @@ void arch_perf_update_userpage(struct perf_event >> *event, >> { >> u32 freq; >> u32 shift; >> + u64 count, ns, quot, rem; >> >> /* >> * Internal timekeeping for enabled/running/stopped times >> @@ -1164,5 +1165,21 @@ void arch_perf_update_userpage(struct >> perf_event *event, >> userpg->time_mult >>= 1; >> } >> userpg->time_shift = (u16)shift; >> - userpg->time_offset = -now; >> + >> + /* >> + * Since arch timer is enabled ealier than sched clock registration, >> + * compuate the delta (in nanosecond unit) between the arch timer >> + * counter and sched clock, assign the delta to time_offset and >> + * perf tool can use it for timestamp calculation. >> + * >> + * The formula for conversion arch timer cycle to ns is: >> + * quot = (cyc >> time_shift); >> + * rem = cyc & ((1 << time_shift) - 1); >> + * ns = quot * time_mult + ((rem * time_mult) >> time_shift); >> + */ >> + count = arch_timer_read_counter(); >> + quot = count >> shift; >> + rem = count & ((1 << shift) - 1); >> + ns = quot * userpg->time_mult + ((rem * userpg->time_mult) >> >> shift); >> + userpg->time_offset = now - ns; > > Hmm, reading the counter and calculating the delta feels horribly > approximate to me. It would be much better if we could get hold of the > initial epoch cycles from the point at which sched_clock was > initialised > using the counter. This represents the true cycle delta between the > counter > and what sched_clock uses for 0 ns. I think this is a sensible solution if you want an epoch that starts at 0 with sched_clock being initialized. The other question is whether it is possible to use a different timestamping source for perf that wouldn't need to be offset. > Unfortunately, I can't see a straightforward way to grab that > information. > It looks like x86 pulls this directly from the TSC driver. I wonder if we could/should make __sched_clock_offset available even when CONFIG_HAVE_UNSTABLE_SCHED_CLOCK isn't defined. It feels like it would help with this particular can or worm... M.
On Thu, Apr 30, 2020 at 04:29:23PM +0100, Marc Zyngier wrote: > I wonder if we could/should make __sched_clock_offset available even when > CONFIG_HAVE_UNSTABLE_SCHED_CLOCK isn't defined. It feels like it would > help with this particular can or worm... Errrgh. __sched_clock_offset is only needed on x86 because we transition from one clock device to another on boot. It really shouldn't exist on anything sane. Let me try and understand your particular problem better.
On Thu, Apr 30, 2020 at 06:04:36PM +0200, Peter Zijlstra wrote: > On Thu, Apr 30, 2020 at 04:29:23PM +0100, Marc Zyngier wrote: > > > I wonder if we could/should make __sched_clock_offset available even when > > CONFIG_HAVE_UNSTABLE_SCHED_CLOCK isn't defined. It feels like it would > > help with this particular can or worm... > > Errrgh. __sched_clock_offset is only needed on x86 because we transition > from one clock device to another on boot. It really shouldn't exist on > anything sane. I think we still transition from jiffies on arm64, because we don't register with sched_clock until the timer driver probes. Marc, is that right? > Let me try and understand your particular problem better. I think the long and short of it is that userspace needs a way to convert the raw counter cycles into a ns value that can be compared against values coming out of sched_clock. To do this accurately, I think it needs the cycles value at the point when sched_clock was initialised. Will
On Thu, Apr 30, 2020 at 03:58:24PM +0100, Will Deacon wrote: > On Fri, Mar 20, 2020 at 05:35:45PM +0800, Leo Yan wrote: > > @@ -1164,5 +1165,21 @@ void arch_perf_update_userpage(struct perf_event *event, > > userpg->time_mult >>= 1; > > } > > userpg->time_shift = (u16)shift; > > - userpg->time_offset = -now; > > + > > + /* > > + * Since arch timer is enabled ealier than sched clock registration, > > + * compuate the delta (in nanosecond unit) between the arch timer > > + * counter and sched clock, assign the delta to time_offset and > > + * perf tool can use it for timestamp calculation. > > + * > > + * The formula for conversion arch timer cycle to ns is: > > + * quot = (cyc >> time_shift); > > + * rem = cyc & ((1 << time_shift) - 1); > > + * ns = quot * time_mult + ((rem * time_mult) >> time_shift); > > + */ > > + count = arch_timer_read_counter(); > > + quot = count >> shift; > > + rem = count & ((1 << shift) - 1); > > + ns = quot * userpg->time_mult + ((rem * userpg->time_mult) >> shift); > > + userpg->time_offset = now - ns; > > Hmm, reading the counter and calculating the delta feels horribly > approximate to me. It would be much better if we could get hold of the > initial epoch cycles from the point at which sched_clock was initialised > using the counter. This represents the true cycle delta between the counter > and what sched_clock uses for 0 ns. > > Unfortunately, I can't see a straightforward way to grab that information. > It looks like x86 pulls this directly from the TSC driver. Yeah, and I'm thinking you should do the same. IIRC ARM uses this kernel/time/sched_clock.c thing, and if I read that right, the struct clock_data there has all the bits you need here. So I'm thinking that you might want to add a helper function here to get you the good stuff.
On Thu, Apr 30, 2020 at 06:27:50PM +0200, Peter Zijlstra wrote: > On Thu, Apr 30, 2020 at 03:58:24PM +0100, Will Deacon wrote: > > On Fri, Mar 20, 2020 at 05:35:45PM +0800, Leo Yan wrote: > > > + /* > > > + * Since arch timer is enabled ealier than sched clock registration, > > > + * compuate the delta (in nanosecond unit) between the arch timer > > > + * counter and sched clock, assign the delta to time_offset and > > > + * perf tool can use it for timestamp calculation. > > > + * > > > + * The formula for conversion arch timer cycle to ns is: > > > + * quot = (cyc >> time_shift); > > > + * rem = cyc & ((1 << time_shift) - 1); > > > + * ns = quot * time_mult + ((rem * time_mult) >> time_shift); > > > + */ > > > + count = arch_timer_read_counter(); > > > + quot = count >> shift; > > > + rem = count & ((1 << shift) - 1); > > > + ns = quot * userpg->time_mult + ((rem * userpg->time_mult) >> shift); > > > + userpg->time_offset = now - ns; > > > > Hmm, reading the counter and calculating the delta feels horribly > > approximate to me. It would be much better if we could get hold of the > > initial epoch cycles from the point at which sched_clock was initialised > > using the counter. This represents the true cycle delta between the counter > > and what sched_clock uses for 0 ns. > > > > Unfortunately, I can't see a straightforward way to grab that information. > > It looks like x86 pulls this directly from the TSC driver. > > Yeah, and I'm thinking you should do the same. IIRC ARM uses this > kernel/time/sched_clock.c thing, and if I read that right, the struct > clock_data there has all the bits you need here. > > So I'm thinking that you might want to add a helper function here to get > you the good stuff. Thanks, Peter. Leo -- do you think you could look at implementing this as part of a v2, please? Will
On 2020-04-30 17:18, Will Deacon wrote: > On Thu, Apr 30, 2020 at 06:04:36PM +0200, Peter Zijlstra wrote: >> On Thu, Apr 30, 2020 at 04:29:23PM +0100, Marc Zyngier wrote: >> >> > I wonder if we could/should make __sched_clock_offset available even when >> > CONFIG_HAVE_UNSTABLE_SCHED_CLOCK isn't defined. It feels like it would >> > help with this particular can or worm... >> >> Errrgh. __sched_clock_offset is only needed on x86 because we >> transition >> from one clock device to another on boot. It really shouldn't exist on >> anything sane. > > I think we still transition from jiffies on arm64, because we don't > register > with sched_clock until the timer driver probes. Marc, is that right? Indeed. The clocksource is only available relatively late, as we need to discover the details of the platform and enable the various workarounds (because nobody can get a simple 64bit counter right). So it is only at that stage that we transition to it. M.
Hi all, On Thu, Apr 30, 2020 at 05:18:15PM +0100, Will Deacon wrote: > On Thu, Apr 30, 2020 at 06:04:36PM +0200, Peter Zijlstra wrote: > > On Thu, Apr 30, 2020 at 04:29:23PM +0100, Marc Zyngier wrote: > > > > > I wonder if we could/should make __sched_clock_offset available even when > > > CONFIG_HAVE_UNSTABLE_SCHED_CLOCK isn't defined. It feels like it would > > > help with this particular can or worm... > > > > Errrgh. __sched_clock_offset is only needed on x86 because we transition > > from one clock device to another on boot. It really shouldn't exist on > > anything sane. > > I think we still transition from jiffies on arm64, because we don't register > with sched_clock until the timer driver probes. Marc, is that right? > > > Let me try and understand your particular problem better. > > I think the long and short of it is that userspace needs a way to convert > the raw counter cycles into a ns value that can be compared against values > coming out of sched_clock. To do this accurately, I think it needs the > cycles value at the point when sched_clock was initialised. Will's understanding is exactly what I want to resolve in this patch. The background info is for the ARM SPE [1] decoding with perf tool, if the timestamp is enabled, it uses the generic timer's counter as timestamp source. SPE trace data only contains the raw counter cycles, as Will mentioned, the perf tool needs to convert it to a coordinate value with sched_clock. This is why this patch tries to calculate the offset between the raw counter's ns value and sched_clock, eventually this offset value will be used by SPE's decoding code in Perf tool to calibrate a 'correct' timestamp. Based on your suggestions, I will use __sched_clock_offset to resolve the accuracy issue in patch v2. (I noticed Peter suggested to use a new API for wrapping clock_data structure, IIUC, __sched_clock_offset is more straightforward for this case). Please correct if I miss anything. Thank you for reviewing and suggestions! Thanks, Leo [1] https://lkml.org/lkml/2020/1/23/571
On Fri, May 01, 2020 at 11:14:48PM +0800, Leo Yan wrote: > On Thu, Apr 30, 2020 at 05:18:15PM +0100, Will Deacon wrote: > > On Thu, Apr 30, 2020 at 06:04:36PM +0200, Peter Zijlstra wrote: > > > On Thu, Apr 30, 2020 at 04:29:23PM +0100, Marc Zyngier wrote: > > > > > > > I wonder if we could/should make __sched_clock_offset available even when > > > > CONFIG_HAVE_UNSTABLE_SCHED_CLOCK isn't defined. It feels like it would > > > > help with this particular can or worm... > > > > > > Errrgh. __sched_clock_offset is only needed on x86 because we transition > > > from one clock device to another on boot. It really shouldn't exist on > > > anything sane. > > > > I think we still transition from jiffies on arm64, because we don't register > > with sched_clock until the timer driver probes. Marc, is that right? > > > > > Let me try and understand your particular problem better. > > > > I think the long and short of it is that userspace needs a way to convert > > the raw counter cycles into a ns value that can be compared against values > > coming out of sched_clock. To do this accurately, I think it needs the > > cycles value at the point when sched_clock was initialised. > > Will's understanding is exactly what I want to resolve in this patch. > > The background info is for the ARM SPE [1] decoding with perf tool, if > the timestamp is enabled, it uses the generic timer's counter as > timestamp source. SPE trace data only contains the raw counter cycles, > as Will mentioned, the perf tool needs to convert it to a coordinate > value with sched_clock. This is why this patch tries to calculate the > offset between the raw counter's ns value and sched_clock, eventually > this offset value will be used by SPE's decoding code in Perf tool to > calibrate a 'correct' timestamp. > > Based on your suggestions, I will use __sched_clock_offset to resolve > the accuracy issue in patch v2. (I noticed Peter suggested to use a > new API for wrapping clock_data structure, IIUC, __sched_clock_offset > is more straightforward for this case). > > Please correct if I miss anything. Thank you for reviewing and > suggestions! I don't think you can use __sched_clock_offset without selecting HAVE_UNSTABLE_SCHED_CLOCK, and we really don't want to do that just for this. So Peter's idea about exposing what we need is better, although you'll probably need to take care with the switch-over from jiffies. It needs some thought, but one possibility would be to introduce a new variant of sthe ched_clock_register() function that returns the cycle offset, and then we could fish that out of the timer driver. If we're crossing all the 'i's and dotting all the 't's then we'd want to disable the perf userpage if sched_clock changes clocksource too (a bit like we do for the vDSO). Will
On 2020-05-01 16:14, Leo Yan wrote: > Hi all, > > On Thu, Apr 30, 2020 at 05:18:15PM +0100, Will Deacon wrote: >> On Thu, Apr 30, 2020 at 06:04:36PM +0200, Peter Zijlstra wrote: >> > On Thu, Apr 30, 2020 at 04:29:23PM +0100, Marc Zyngier wrote: >> > >> > > I wonder if we could/should make __sched_clock_offset available even when >> > > CONFIG_HAVE_UNSTABLE_SCHED_CLOCK isn't defined. It feels like it would >> > > help with this particular can or worm... >> > >> > Errrgh. __sched_clock_offset is only needed on x86 because we transition >> > from one clock device to another on boot. It really shouldn't exist on >> > anything sane. >> >> I think we still transition from jiffies on arm64, because we don't >> register >> with sched_clock until the timer driver probes. Marc, is that right? >> >> > Let me try and understand your particular problem better. >> >> I think the long and short of it is that userspace needs a way to >> convert >> the raw counter cycles into a ns value that can be compared against >> values >> coming out of sched_clock. To do this accurately, I think it needs the >> cycles value at the point when sched_clock was initialised. > > Will's understanding is exactly what I want to resolve in this patch. > > The background info is for the ARM SPE [1] decoding with perf tool, if > the timestamp is enabled, it uses the generic timer's counter as > timestamp source. SPE trace data only contains the raw counter cycles, > as Will mentioned, the perf tool needs to convert it to a coordinate > value with sched_clock. This is why this patch tries to calculate the > offset between the raw counter's ns value and sched_clock, eventually > this offset value will be used by SPE's decoding code in Perf tool to > calibrate a 'correct' timestamp. > > Based on your suggestions, I will use __sched_clock_offset to resolve > the accuracy issue in patch v2. (I noticed Peter suggested to use a > new API for wrapping clock_data structure, IIUC, __sched_clock_offset > is more straightforward for this case). I think Peter's point was *not* to use __sched_clock_offset which appears to be only there for the purpose of an x86 workaround (and not available to other architecture), but to instead expose the relevant field (epoch_cyc) to the perf subsystem. I feel it makes sense to make this a generic API, and see whether x86 can move over to it rather than the other way around. Thanks, M.
On Fri, May 01, 2020 at 04:26:09PM +0100, Will Deacon wrote: [...] > > > > Let me try and understand your particular problem better. > > > > > > I think the long and short of it is that userspace needs a way to convert > > > the raw counter cycles into a ns value that can be compared against values > > > coming out of sched_clock. To do this accurately, I think it needs the > > > cycles value at the point when sched_clock was initialised. > > > > Will's understanding is exactly what I want to resolve in this patch. > > > > The background info is for the ARM SPE [1] decoding with perf tool, if > > the timestamp is enabled, it uses the generic timer's counter as > > timestamp source. SPE trace data only contains the raw counter cycles, > > as Will mentioned, the perf tool needs to convert it to a coordinate > > value with sched_clock. This is why this patch tries to calculate the > > offset between the raw counter's ns value and sched_clock, eventually > > this offset value will be used by SPE's decoding code in Perf tool to > > calibrate a 'correct' timestamp. > > > > Based on your suggestions, I will use __sched_clock_offset to resolve > > the accuracy issue in patch v2. (I noticed Peter suggested to use a > > new API for wrapping clock_data structure, IIUC, __sched_clock_offset > > is more straightforward for this case). > > > > Please correct if I miss anything. Thank you for reviewing and > > suggestions! > > I don't think you can use __sched_clock_offset without selecting > HAVE_UNSTABLE_SCHED_CLOCK, and we really don't want to do that just > for this. So Peter's idea about exposing what we need is better, although > you'll probably need to take care with the switch-over from jiffies. > > It needs some thought, but one possibility would be to introduce a new > variant of sthe ched_clock_register() function that returns the cycle > offset, and then we could fish that out of the timer driver. Thanks a lot for you and Marc for correction. > If we're > crossing all the 'i's and dotting all the 't's then we'd want to disable the > perf userpage if sched_clock changes clocksource too (a bit like we do for > the vDSO). To be honest, one thing is not clear for me is how the perf tool to update the arch timer's parameters in the middle of tracing after disable and re-enable per userpage. I will note for this and look into detailed implementation for this part. Thanks for sharing comprehensive thoughts! Leo
On Sat, May 02, 2020 at 12:10:50AM +0800, Leo Yan wrote: > On Fri, May 01, 2020 at 04:26:09PM +0100, Will Deacon wrote: > > [...] > > > > > > Let me try and understand your particular problem better. > > > > > > > > I think the long and short of it is that userspace needs a way to convert > > > > the raw counter cycles into a ns value that can be compared against values > > > > coming out of sched_clock. To do this accurately, I think it needs the > > > > cycles value at the point when sched_clock was initialised. > > > > > > Will's understanding is exactly what I want to resolve in this patch. > > > > > > The background info is for the ARM SPE [1] decoding with perf tool, if > > > the timestamp is enabled, it uses the generic timer's counter as > > > timestamp source. SPE trace data only contains the raw counter cycles, > > > as Will mentioned, the perf tool needs to convert it to a coordinate > > > value with sched_clock. This is why this patch tries to calculate the > > > offset between the raw counter's ns value and sched_clock, eventually > > > this offset value will be used by SPE's decoding code in Perf tool to > > > calibrate a 'correct' timestamp. > > > > > > Based on your suggestions, I will use __sched_clock_offset to resolve > > > the accuracy issue in patch v2. (I noticed Peter suggested to use a > > > new API for wrapping clock_data structure, IIUC, __sched_clock_offset > > > is more straightforward for this case). > > > > > > Please correct if I miss anything. Thank you for reviewing and > > > suggestions! > > > > I don't think you can use __sched_clock_offset without selecting > > HAVE_UNSTABLE_SCHED_CLOCK, and we really don't want to do that just > > for this. So Peter's idea about exposing what we need is better, although > > you'll probably need to take care with the switch-over from jiffies. > > > > It needs some thought, but one possibility would be to introduce a new > > variant of sthe ched_clock_register() function that returns the cycle > > offset, and then we could fish that out of the timer driver. > > Thanks a lot for you and Marc for correction. > > > If we're > > crossing all the 'i's and dotting all the 't's then we'd want to disable the > > perf userpage if sched_clock changes clocksource too (a bit like we do for > > the vDSO). > > To be honest, one thing is not clear for me is how the perf tool to > update the arch timer's parameters in the middle of tracing after > disable and re-enable per userpage. I will note for this and look > into detailed implementation for this part. I don't fully understand the concern but, generally, the seqlock should take care of any inconsistencies in the data page. Will
diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c index e40b65645c86..226d25d77072 100644 --- a/arch/arm64/kernel/perf_event.c +++ b/arch/arm64/kernel/perf_event.c @@ -1143,6 +1143,7 @@ void arch_perf_update_userpage(struct perf_event *event, { u32 freq; u32 shift; + u64 count, ns, quot, rem; /* * Internal timekeeping for enabled/running/stopped times @@ -1164,5 +1165,21 @@ void arch_perf_update_userpage(struct perf_event *event, userpg->time_mult >>= 1; } userpg->time_shift = (u16)shift; - userpg->time_offset = -now; + + /* + * Since arch timer is enabled ealier than sched clock registration, + * compuate the delta (in nanosecond unit) between the arch timer + * counter and sched clock, assign the delta to time_offset and + * perf tool can use it for timestamp calculation. + * + * The formula for conversion arch timer cycle to ns is: + * quot = (cyc >> time_shift); + * rem = cyc & ((1 << time_shift) - 1); + * ns = quot * time_mult + ((rem * time_mult) >> time_shift); + */ + count = arch_timer_read_counter(); + quot = count >> shift; + rem = count & ((1 << shift) - 1); + ns = quot * userpg->time_mult + ((rem * userpg->time_mult) >> shift); + userpg->time_offset = now - ns; }
Between the system powering on and kernel's sched clock registration, the arch timer usually has been enabled at the early time and its counter is incremented during the period of the booting up. Thus the arch timer's counter is not completely accounted into the sched clock, and has a delta between the arch timer's counter and sched clock. This delta value should be stored into userpg->time_offset, which later can be retrieved by Perf tool in the user space for sample timestamp calculation. Now userpg->time_offset is assigned to the negative sched clock with '-now', this value cannot reflect the delta between arch timer's counter and sched clock, so Perf cannot use it to calculate the sample time. To fix this issue, this patch calculate the delta between the arch timer's and sched clock and assign the delta to userpg->time_offset. The detailed steps are firstly to convert counter to nanoseconds 'ns', then the offset is calculated as 'now' minus 'ns'. |<------------------- 'ns' ---------------------->| |<-------- 'now' -------->| |<---- time_offset ---->| |-----------------------|-------------------------| ^ ^ ^ Power on system sched clock registration Perf starts Signed-off-by: Leo Yan <leo.yan@linaro.org> --- arch/arm64/kernel/perf_event.c | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-)