Message ID | 20210303212800.43787-1-umesh.nerlige.ramappa@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | i915/query: Correlate engine and cpu timestamps with better accuracy | expand |
Quoting Umesh Nerlige Ramappa (2021-03-03 21:28:00) > Perf measurements rely on CPU and engine timestamps to correlate > events of interest across these time domains. Current mechanisms get > these timestamps separately and the calculated delta between these > timestamps lack enough accuracy. > > To improve the accuracy of these time measurements to within a few us, > add a query that returns the engine and cpu timestamps captured as > close to each other as possible. > > v2: (Tvrtko) > - document clock reference used > - return cpu timestamp always > - capture cpu time just before lower dword of cs timestamp > > v3: (Chris) > - use uncore-rpm > - use __query_cs_timestamp helper > > v4: (Lionel) > - Kernel perf subsytem allows users to specify the clock id to be used > in perf_event_open. This clock id is used by the perf subsystem to > return the appropriate cpu timestamp in perf events. Similarly, let > the user pass the clockid to this query so that cpu timestamp > corresponds to the clock id requested. > > v5: (Tvrtko) > - Use normal ktime accessors instead of fast versions > - Add more uApi documentation > > v6: (Lionel) > - Move switch out of spinlock > > v7: (Chris) > - cs_timestamp is a misnomer, use cs_cycles instead > - return the cs cycle frequency as well in the query > > v8: > - Add platform and engine specific checks > > v9: (Lionel) > - Return 2 cpu timestamps in the query - captured before and after the > register read > > Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> > --- > drivers/gpu/drm/i915/i915_query.c | 144 ++++++++++++++++++++++++++++++ > include/uapi/drm/i915_drm.h | 47 ++++++++++ > 2 files changed, 191 insertions(+) > > diff --git a/drivers/gpu/drm/i915/i915_query.c b/drivers/gpu/drm/i915/i915_query.c > index fed337ad7b68..acca22ee6014 100644 > --- a/drivers/gpu/drm/i915/i915_query.c > +++ b/drivers/gpu/drm/i915/i915_query.c > @@ -6,6 +6,8 @@ > > #include <linux/nospec.h> > > +#include "gt/intel_engine_pm.h" > +#include "gt/intel_engine_user.h" > #include "i915_drv.h" > #include "i915_perf.h" > #include "i915_query.h" > @@ -90,6 +92,147 @@ static int query_topology_info(struct drm_i915_private *dev_priv, > return total_length; > } > > +typedef u64 (*__ktime_func_t)(void); > +static __ktime_func_t __clock_id_to_func(clockid_t clk_id) > +{ > + /* > + * Use logic same as the perf subsystem to allow user to select the > + * reference clock id to be used for timestamps. > + */ > + switch (clk_id) { > + case CLOCK_MONOTONIC: > + return &ktime_get_ns; > + case CLOCK_MONOTONIC_RAW: > + return &ktime_get_raw_ns; > + case CLOCK_REALTIME: > + return &ktime_get_real_ns; > + case CLOCK_BOOTTIME: > + return &ktime_get_boottime_ns; > + case CLOCK_TAI: > + return &ktime_get_clocktai_ns; > + default: > + return NULL; > + } > +} > + > +static inline int > +__read_timestamps(struct intel_uncore *uncore, > + i915_reg_t lower_reg, > + i915_reg_t upper_reg, > + u64 *cs_ts, > + u64 *cpu_ts, > + __ktime_func_t cpu_clock) > +{ > + u32 upper, lower, old_upper, loop = 0; > + > + upper = intel_uncore_read_fw(uncore, upper_reg); > + do { > + cpu_ts[0] = cpu_clock(); > + lower = intel_uncore_read_fw(uncore, lower_reg); > + cpu_ts[1] = cpu_clock(); > + old_upper = upper; > + upper = intel_uncore_read_fw(uncore, upper_reg); Both register reads comprise the timestamp returned to userspace, so presumably you want cpu_ts[] to wrap both. do { old_upper = upper; cpu_ts[0] = cpu_clock(); lower = intel_uncore_read_fw(uncore, lower_reg); upper = intel_uncore_read_fw(uncore, upper_reg); cpu_ts[1] = cpu_clock(); } while (upper != old_upper && loop++ < 2); --------------------------------------------------------------------- Intel Corporation (UK) Limited Registered No. 1134945 (England) Registered Office: Pipers Way, Swindon SN3 1RJ VAT No: 860 2173 47 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
On 04/03/2021 02:09, Chris Wilson wrote: > Quoting Umesh Nerlige Ramappa (2021-03-03 21:28:00) >> Perf measurements rely on CPU and engine timestamps to correlate >> events of interest across these time domains. Current mechanisms get >> these timestamps separately and the calculated delta between these >> timestamps lack enough accuracy. >> >> To improve the accuracy of these time measurements to within a few us, >> add a query that returns the engine and cpu timestamps captured as >> close to each other as possible. >> >> v2: (Tvrtko) >> - document clock reference used >> - return cpu timestamp always >> - capture cpu time just before lower dword of cs timestamp >> >> v3: (Chris) >> - use uncore-rpm >> - use __query_cs_timestamp helper >> >> v4: (Lionel) >> - Kernel perf subsytem allows users to specify the clock id to be used >> in perf_event_open. This clock id is used by the perf subsystem to >> return the appropriate cpu timestamp in perf events. Similarly, let >> the user pass the clockid to this query so that cpu timestamp >> corresponds to the clock id requested. >> >> v5: (Tvrtko) >> - Use normal ktime accessors instead of fast versions >> - Add more uApi documentation >> >> v6: (Lionel) >> - Move switch out of spinlock >> >> v7: (Chris) >> - cs_timestamp is a misnomer, use cs_cycles instead >> - return the cs cycle frequency as well in the query >> >> v8: >> - Add platform and engine specific checks >> >> v9: (Lionel) >> - Return 2 cpu timestamps in the query - captured before and after the >> register read >> >> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> >> --- >> drivers/gpu/drm/i915/i915_query.c | 144 ++++++++++++++++++++++++++++++ >> include/uapi/drm/i915_drm.h | 47 ++++++++++ >> 2 files changed, 191 insertions(+) >> >> diff --git a/drivers/gpu/drm/i915/i915_query.c b/drivers/gpu/drm/i915/i915_query.c >> index fed337ad7b68..acca22ee6014 100644 >> --- a/drivers/gpu/drm/i915/i915_query.c >> +++ b/drivers/gpu/drm/i915/i915_query.c >> @@ -6,6 +6,8 @@ >> >> #include <linux/nospec.h> >> >> +#include "gt/intel_engine_pm.h" >> +#include "gt/intel_engine_user.h" >> #include "i915_drv.h" >> #include "i915_perf.h" >> #include "i915_query.h" >> @@ -90,6 +92,147 @@ static int query_topology_info(struct drm_i915_private *dev_priv, >> return total_length; >> } >> >> +typedef u64 (*__ktime_func_t)(void); >> +static __ktime_func_t __clock_id_to_func(clockid_t clk_id) >> +{ >> + /* >> + * Use logic same as the perf subsystem to allow user to select the >> + * reference clock id to be used for timestamps. >> + */ >> + switch (clk_id) { >> + case CLOCK_MONOTONIC: >> + return &ktime_get_ns; >> + case CLOCK_MONOTONIC_RAW: >> + return &ktime_get_raw_ns; >> + case CLOCK_REALTIME: >> + return &ktime_get_real_ns; >> + case CLOCK_BOOTTIME: >> + return &ktime_get_boottime_ns; >> + case CLOCK_TAI: >> + return &ktime_get_clocktai_ns; >> + default: >> + return NULL; >> + } >> +} >> + >> +static inline int >> +__read_timestamps(struct intel_uncore *uncore, >> + i915_reg_t lower_reg, >> + i915_reg_t upper_reg, >> + u64 *cs_ts, >> + u64 *cpu_ts, >> + __ktime_func_t cpu_clock) >> +{ >> + u32 upper, lower, old_upper, loop = 0; >> + >> + upper = intel_uncore_read_fw(uncore, upper_reg); >> + do { >> + cpu_ts[0] = cpu_clock(); >> + lower = intel_uncore_read_fw(uncore, lower_reg); >> + cpu_ts[1] = cpu_clock(); >> + old_upper = upper; >> + upper = intel_uncore_read_fw(uncore, upper_reg); > Both register reads comprise the timestamp returned to userspace, so > presumably you want cpu_ts[] to wrap both. > > do { > old_upper = upper; > > cpu_ts[0] = cpu_clock(); > lower = intel_uncore_read_fw(uncore, lower_reg); > upper = intel_uncore_read_fw(uncore, upper_reg); > cpu_ts[1] = cpu_clock(); > } while (upper != old_upper && loop++ < 2); Actually if we want the best accuracy we can just deal with the lower dword. We can check the upper one hasn't changed outside of the 2 cpu_clock() calls. -Lionel
On 03/03/2021 23:28, Umesh Nerlige Ramappa wrote: > Perf measurements rely on CPU and engine timestamps to correlate > events of interest across these time domains. Current mechanisms get > these timestamps separately and the calculated delta between these > timestamps lack enough accuracy. > > To improve the accuracy of these time measurements to within a few us, > add a query that returns the engine and cpu timestamps captured as > close to each other as possible. > > v2: (Tvrtko) > - document clock reference used > - return cpu timestamp always > - capture cpu time just before lower dword of cs timestamp > > v3: (Chris) > - use uncore-rpm > - use __query_cs_timestamp helper > > v4: (Lionel) > - Kernel perf subsytem allows users to specify the clock id to be used > in perf_event_open. This clock id is used by the perf subsystem to > return the appropriate cpu timestamp in perf events. Similarly, let > the user pass the clockid to this query so that cpu timestamp > corresponds to the clock id requested. > > v5: (Tvrtko) > - Use normal ktime accessors instead of fast versions > - Add more uApi documentation > > v6: (Lionel) > - Move switch out of spinlock > > v7: (Chris) > - cs_timestamp is a misnomer, use cs_cycles instead > - return the cs cycle frequency as well in the query > > v8: > - Add platform and engine specific checks > > v9: (Lionel) > - Return 2 cpu timestamps in the query - captured before and after the > register read > > Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> Looks good to me, thanks! Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Let me prepare a Mesa MR to use this with VK_EXT_calibrated_timestamps. -Lionel > --- > drivers/gpu/drm/i915/i915_query.c | 144 ++++++++++++++++++++++++++++++ > include/uapi/drm/i915_drm.h | 47 ++++++++++ > 2 files changed, 191 insertions(+) > > diff --git a/drivers/gpu/drm/i915/i915_query.c b/drivers/gpu/drm/i915/i915_query.c > index fed337ad7b68..acca22ee6014 100644 > --- a/drivers/gpu/drm/i915/i915_query.c > +++ b/drivers/gpu/drm/i915/i915_query.c > @@ -6,6 +6,8 @@ > > #include <linux/nospec.h> > > +#include "gt/intel_engine_pm.h" > +#include "gt/intel_engine_user.h" > #include "i915_drv.h" > #include "i915_perf.h" > #include "i915_query.h" > @@ -90,6 +92,147 @@ static int query_topology_info(struct drm_i915_private *dev_priv, > return total_length; > } > > +typedef u64 (*__ktime_func_t)(void); > +static __ktime_func_t __clock_id_to_func(clockid_t clk_id) > +{ > + /* > + * Use logic same as the perf subsystem to allow user to select the > + * reference clock id to be used for timestamps. > + */ > + switch (clk_id) { > + case CLOCK_MONOTONIC: > + return &ktime_get_ns; > + case CLOCK_MONOTONIC_RAW: > + return &ktime_get_raw_ns; > + case CLOCK_REALTIME: > + return &ktime_get_real_ns; > + case CLOCK_BOOTTIME: > + return &ktime_get_boottime_ns; > + case CLOCK_TAI: > + return &ktime_get_clocktai_ns; > + default: > + return NULL; > + } > +} > + > +static inline int > +__read_timestamps(struct intel_uncore *uncore, > + i915_reg_t lower_reg, > + i915_reg_t upper_reg, > + u64 *cs_ts, > + u64 *cpu_ts, > + __ktime_func_t cpu_clock) > +{ > + u32 upper, lower, old_upper, loop = 0; > + > + upper = intel_uncore_read_fw(uncore, upper_reg); > + do { > + cpu_ts[0] = cpu_clock(); > + lower = intel_uncore_read_fw(uncore, lower_reg); > + cpu_ts[1] = cpu_clock(); > + old_upper = upper; > + upper = intel_uncore_read_fw(uncore, upper_reg); > + } while (upper != old_upper && loop++ < 2); > + > + *cs_ts = (u64)upper << 32 | lower; > + > + return 0; > +} > + > +static int > +__query_cs_cycles(struct intel_engine_cs *engine, > + u64 *cs_ts, u64 *cpu_ts, > + __ktime_func_t cpu_clock) > +{ > + struct intel_uncore *uncore = engine->uncore; > + enum forcewake_domains fw_domains; > + u32 base = engine->mmio_base; > + intel_wakeref_t wakeref; > + int ret; > + > + fw_domains = intel_uncore_forcewake_for_reg(uncore, > + RING_TIMESTAMP(base), > + FW_REG_READ); > + > + with_intel_runtime_pm(uncore->rpm, wakeref) { > + spin_lock_irq(&uncore->lock); > + intel_uncore_forcewake_get__locked(uncore, fw_domains); > + > + ret = __read_timestamps(uncore, > + RING_TIMESTAMP(base), > + RING_TIMESTAMP_UDW(base), > + cs_ts, > + cpu_ts, > + cpu_clock); > + > + intel_uncore_forcewake_put__locked(uncore, fw_domains); > + spin_unlock_irq(&uncore->lock); > + } > + > + return ret; > +} > + > +static int > +query_cs_cycles(struct drm_i915_private *i915, > + struct drm_i915_query_item *query_item) > +{ > + struct drm_i915_query_cs_cycles __user *query_ptr; > + struct drm_i915_query_cs_cycles query; > + struct intel_engine_cs *engine; > + __ktime_func_t cpu_clock; > + int ret; > + > + if (INTEL_GEN(i915) < 6) > + return -ENODEV; > + > + query_ptr = u64_to_user_ptr(query_item->data_ptr); > + ret = copy_query_item(&query, sizeof(query), sizeof(query), query_item); > + if (ret != 0) > + return ret; > + > + if (query.flags) > + return -EINVAL; > + > + if (query.rsvd) > + return -EINVAL; > + > + cpu_clock = __clock_id_to_func(query.clockid); > + if (!cpu_clock) > + return -EINVAL; > + > + engine = intel_engine_lookup_user(i915, > + query.engine.engine_class, > + query.engine.engine_instance); > + if (!engine) > + return -EINVAL; > + > + if (IS_GEN(i915, 6) && > + query.engine.engine_class != I915_ENGINE_CLASS_RENDER) > + return -ENODEV; > + > + query.cs_frequency = engine->gt->clock_frequency; > + ret = __query_cs_cycles(engine, > + &query.cs_cycles, > + query.cpu_timestamp, > + cpu_clock); > + if (ret) > + return ret; > + > + if (put_user(query.cs_frequency, &query_ptr->cs_frequency)) > + return -EFAULT; > + > + if (put_user(query.cpu_timestamp[0], &query_ptr->cpu_timestamp[0])) > + return -EFAULT; > + > + if (put_user(query.cpu_timestamp[1], &query_ptr->cpu_timestamp[1])) > + return -EFAULT; > + > + if (put_user(query.cs_cycles, &query_ptr->cs_cycles)) > + return -EFAULT; > + > + return sizeof(query); > +} > + > static int > query_engine_info(struct drm_i915_private *i915, > struct drm_i915_query_item *query_item) > @@ -424,6 +567,7 @@ static int (* const i915_query_funcs[])(struct drm_i915_private *dev_priv, > query_topology_info, > query_engine_info, > query_perf_config, > + query_cs_cycles, > }; > > int i915_query_ioctl(struct drm_device *dev, void *data, struct drm_file *file) > diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h > index 1987e2ea79a3..abcc479e2be1 100644 > --- a/include/uapi/drm/i915_drm.h > +++ b/include/uapi/drm/i915_drm.h > @@ -2176,6 +2176,10 @@ struct drm_i915_query_item { > #define DRM_I915_QUERY_TOPOLOGY_INFO 1 > #define DRM_I915_QUERY_ENGINE_INFO 2 > #define DRM_I915_QUERY_PERF_CONFIG 3 > + /** > + * Query Command Streamer timestamp register. > + */ > +#define DRM_I915_QUERY_CS_CYCLES 4 > /* Must be kept compact -- no holes and well documented */ > > /* > @@ -2309,6 +2313,49 @@ struct drm_i915_engine_info { > __u64 rsvd1[4]; > }; > > +/** > + * struct drm_i915_query_cs_cycles > + * > + * The query returns the command streamer cycles and the frequency that can be > + * used to calculate the command streamer timestamp. In addition the query > + * returns the cpu timestamp that indicates when the command streamer cycle > + * count was captured. > + */ > +struct drm_i915_query_cs_cycles { > + /** Engine for which command streamer cycles is queried. */ > + struct i915_engine_class_instance engine; > + > + /** Must be zero. */ > + __u32 flags; > + > + /** > + * Command streamer cycles as read from the command streamer > + * register at 0x358 offset. > + */ > + __u64 cs_cycles; > + > + /** Frequency of the cs cycles in Hz. */ > + __u64 cs_frequency; > + > + /** > + * CPU timestamp in nanoseconds. cpu_timestamp[0] is captured before > + * reading the cs_cycles register and cpu_timestamp[1] is captured after > + * reading the register. > + **/ > + __u64 cpu_timestamp[2]; > + > + /** > + * Reference clock id for CPU timestamp. For definition, see > + * clock_gettime(2) and perf_event_open(2). Supported clock ids are > + * CLOCK_MONOTONIC, CLOCK_MONOTONIC_RAW, CLOCK_REALTIME, CLOCK_BOOTTIME, > + * CLOCK_TAI. > + */ > + __s32 clockid; > + > + /** Must be zero. */ > + __u32 rsvd; > +}; > + > /** > * struct drm_i915_query_engine_info > *
Quoting Lionel Landwerlin (2021-03-04 08:28:59) > On 04/03/2021 02:09, Chris Wilson wrote: > > Quoting Umesh Nerlige Ramappa (2021-03-03 21:28:00) > >> Perf measurements rely on CPU and engine timestamps to correlate > >> events of interest across these time domains. Current mechanisms get > >> these timestamps separately and the calculated delta between these > >> timestamps lack enough accuracy. > >> > >> To improve the accuracy of these time measurements to within a few us, > >> add a query that returns the engine and cpu timestamps captured as > >> close to each other as possible. > >> > >> v2: (Tvrtko) > >> - document clock reference used > >> - return cpu timestamp always > >> - capture cpu time just before lower dword of cs timestamp > >> > >> v3: (Chris) > >> - use uncore-rpm > >> - use __query_cs_timestamp helper > >> > >> v4: (Lionel) > >> - Kernel perf subsytem allows users to specify the clock id to be used > >> in perf_event_open. This clock id is used by the perf subsystem to > >> return the appropriate cpu timestamp in perf events. Similarly, let > >> the user pass the clockid to this query so that cpu timestamp > >> corresponds to the clock id requested. > >> > >> v5: (Tvrtko) > >> - Use normal ktime accessors instead of fast versions > >> - Add more uApi documentation > >> > >> v6: (Lionel) > >> - Move switch out of spinlock > >> > >> v7: (Chris) > >> - cs_timestamp is a misnomer, use cs_cycles instead > >> - return the cs cycle frequency as well in the query > >> > >> v8: > >> - Add platform and engine specific checks > >> > >> v9: (Lionel) > >> - Return 2 cpu timestamps in the query - captured before and after the > >> register read > >> > >> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> > >> --- > >> drivers/gpu/drm/i915/i915_query.c | 144 ++++++++++++++++++++++++++++++ > >> include/uapi/drm/i915_drm.h | 47 ++++++++++ > >> 2 files changed, 191 insertions(+) > >> > >> diff --git a/drivers/gpu/drm/i915/i915_query.c b/drivers/gpu/drm/i915/i915_query.c > >> index fed337ad7b68..acca22ee6014 100644 > >> --- a/drivers/gpu/drm/i915/i915_query.c > >> +++ b/drivers/gpu/drm/i915/i915_query.c > >> @@ -6,6 +6,8 @@ > >> > >> #include <linux/nospec.h> > >> > >> +#include "gt/intel_engine_pm.h" > >> +#include "gt/intel_engine_user.h" > >> #include "i915_drv.h" > >> #include "i915_perf.h" > >> #include "i915_query.h" > >> @@ -90,6 +92,147 @@ static int query_topology_info(struct drm_i915_private *dev_priv, > >> return total_length; > >> } > >> > >> +typedef u64 (*__ktime_func_t)(void); > >> +static __ktime_func_t __clock_id_to_func(clockid_t clk_id) > >> +{ > >> + /* > >> + * Use logic same as the perf subsystem to allow user to select the > >> + * reference clock id to be used for timestamps. > >> + */ > >> + switch (clk_id) { > >> + case CLOCK_MONOTONIC: > >> + return &ktime_get_ns; > >> + case CLOCK_MONOTONIC_RAW: > >> + return &ktime_get_raw_ns; > >> + case CLOCK_REALTIME: > >> + return &ktime_get_real_ns; > >> + case CLOCK_BOOTTIME: > >> + return &ktime_get_boottime_ns; > >> + case CLOCK_TAI: > >> + return &ktime_get_clocktai_ns; > >> + default: > >> + return NULL; > >> + } > >> +} > >> + > >> +static inline int > >> +__read_timestamps(struct intel_uncore *uncore, > >> + i915_reg_t lower_reg, > >> + i915_reg_t upper_reg, > >> + u64 *cs_ts, > >> + u64 *cpu_ts, > >> + __ktime_func_t cpu_clock) > >> +{ > >> + u32 upper, lower, old_upper, loop = 0; > >> + > >> + upper = intel_uncore_read_fw(uncore, upper_reg); > >> + do { > >> + cpu_ts[0] = cpu_clock(); > >> + lower = intel_uncore_read_fw(uncore, lower_reg); > >> + cpu_ts[1] = cpu_clock(); > >> + old_upper = upper; > >> + upper = intel_uncore_read_fw(uncore, upper_reg); > > Both register reads comprise the timestamp returned to userspace, so > > presumably you want cpu_ts[] to wrap both. > > > > do { > > old_upper = upper; > > > > cpu_ts[0] = cpu_clock(); > > lower = intel_uncore_read_fw(uncore, lower_reg); > > upper = intel_uncore_read_fw(uncore, upper_reg); > > cpu_ts[1] = cpu_clock(); > > } while (upper != old_upper && loop++ < 2); > > Actually if we want the best accuracy we can just deal with the lower dword. Accuracy of what? The lower dword read perhaps, or the accuracy of the sample point for the combined reads for the timestamp, which is closer to an external observer (cpu_clock() implies reference to an external observer). The two clock samples are not even necessarily closely related due to the nmi adjustments. If you wanted an unadjusted elapsed time for the read you can use local_clock() then return the chosen cpu_clock() before plus the elapsed delta from around the read as the estimated error. cpu_ts[1] = local_clock(); cpu_ts[0] = cpu_clock(); lower = intel_uncore_read_fw(uncore, lower_reg); cpu_ts[1] = local_clock() - cpu_ts[1]; -Chris
On 04/03/2021 10:58, Chris Wilson wrote: > Quoting Lionel Landwerlin (2021-03-04 08:28:59) >> On 04/03/2021 02:09, Chris Wilson wrote: >>> Quoting Umesh Nerlige Ramappa (2021-03-03 21:28:00) >>>> Perf measurements rely on CPU and engine timestamps to correlate >>>> events of interest across these time domains. Current mechanisms get >>>> these timestamps separately and the calculated delta between these >>>> timestamps lack enough accuracy. >>>> >>>> To improve the accuracy of these time measurements to within a few us, >>>> add a query that returns the engine and cpu timestamps captured as >>>> close to each other as possible. >>>> >>>> v2: (Tvrtko) >>>> - document clock reference used >>>> - return cpu timestamp always >>>> - capture cpu time just before lower dword of cs timestamp >>>> >>>> v3: (Chris) >>>> - use uncore-rpm >>>> - use __query_cs_timestamp helper >>>> >>>> v4: (Lionel) >>>> - Kernel perf subsytem allows users to specify the clock id to be used >>>> in perf_event_open. This clock id is used by the perf subsystem to >>>> return the appropriate cpu timestamp in perf events. Similarly, let >>>> the user pass the clockid to this query so that cpu timestamp >>>> corresponds to the clock id requested. >>>> >>>> v5: (Tvrtko) >>>> - Use normal ktime accessors instead of fast versions >>>> - Add more uApi documentation >>>> >>>> v6: (Lionel) >>>> - Move switch out of spinlock >>>> >>>> v7: (Chris) >>>> - cs_timestamp is a misnomer, use cs_cycles instead >>>> - return the cs cycle frequency as well in the query >>>> >>>> v8: >>>> - Add platform and engine specific checks >>>> >>>> v9: (Lionel) >>>> - Return 2 cpu timestamps in the query - captured before and after the >>>> register read >>>> >>>> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> >>>> --- >>>> drivers/gpu/drm/i915/i915_query.c | 144 ++++++++++++++++++++++++++++++ >>>> include/uapi/drm/i915_drm.h | 47 ++++++++++ >>>> 2 files changed, 191 insertions(+) >>>> >>>> diff --git a/drivers/gpu/drm/i915/i915_query.c b/drivers/gpu/drm/i915/i915_query.c >>>> index fed337ad7b68..acca22ee6014 100644 >>>> --- a/drivers/gpu/drm/i915/i915_query.c >>>> +++ b/drivers/gpu/drm/i915/i915_query.c >>>> @@ -6,6 +6,8 @@ >>>> >>>> #include <linux/nospec.h> >>>> >>>> +#include "gt/intel_engine_pm.h" >>>> +#include "gt/intel_engine_user.h" >>>> #include "i915_drv.h" >>>> #include "i915_perf.h" >>>> #include "i915_query.h" >>>> @@ -90,6 +92,147 @@ static int query_topology_info(struct drm_i915_private *dev_priv, >>>> return total_length; >>>> } >>>> >>>> +typedef u64 (*__ktime_func_t)(void); >>>> +static __ktime_func_t __clock_id_to_func(clockid_t clk_id) >>>> +{ >>>> + /* >>>> + * Use logic same as the perf subsystem to allow user to select the >>>> + * reference clock id to be used for timestamps. >>>> + */ >>>> + switch (clk_id) { >>>> + case CLOCK_MONOTONIC: >>>> + return &ktime_get_ns; >>>> + case CLOCK_MONOTONIC_RAW: >>>> + return &ktime_get_raw_ns; >>>> + case CLOCK_REALTIME: >>>> + return &ktime_get_real_ns; >>>> + case CLOCK_BOOTTIME: >>>> + return &ktime_get_boottime_ns; >>>> + case CLOCK_TAI: >>>> + return &ktime_get_clocktai_ns; >>>> + default: >>>> + return NULL; >>>> + } >>>> +} >>>> + >>>> +static inline int >>>> +__read_timestamps(struct intel_uncore *uncore, >>>> + i915_reg_t lower_reg, >>>> + i915_reg_t upper_reg, >>>> + u64 *cs_ts, >>>> + u64 *cpu_ts, >>>> + __ktime_func_t cpu_clock) >>>> +{ >>>> + u32 upper, lower, old_upper, loop = 0; >>>> + >>>> + upper = intel_uncore_read_fw(uncore, upper_reg); >>>> + do { >>>> + cpu_ts[0] = cpu_clock(); >>>> + lower = intel_uncore_read_fw(uncore, lower_reg); >>>> + cpu_ts[1] = cpu_clock(); >>>> + old_upper = upper; >>>> + upper = intel_uncore_read_fw(uncore, upper_reg); >>> Both register reads comprise the timestamp returned to userspace, so >>> presumably you want cpu_ts[] to wrap both. >>> >>> do { >>> old_upper = upper; >>> >>> cpu_ts[0] = cpu_clock(); >>> lower = intel_uncore_read_fw(uncore, lower_reg); >>> upper = intel_uncore_read_fw(uncore, upper_reg); >>> cpu_ts[1] = cpu_clock(); >>> } while (upper != old_upper && loop++ < 2); >> Actually if we want the best accuracy we can just deal with the lower dword. > Accuracy of what? The lower dword read perhaps, or the accuracy of the > sample point for the combined reads for the timestamp, which is closer > to an external observer (cpu_clock() implies reference to an external > observer). > > The two clock samples are not even necessarily closely related due to the > nmi adjustments. If you wanted an unadjusted elapsed time for the read > you can use local_clock() then return the chosen cpu_clock() before plus > the elapsed delta from around the read as the estimated error. > > cpu_ts[1] = local_clock(); > cpu_ts[0] = cpu_clock(); > lower = intel_uncore_read_fw(uncore, lower_reg); > cpu_ts[1] = local_clock() - cpu_ts[1]; > -Chris Thanks, I meant the accuracy of having 2 samples GPU/CPU as close as possible. Avoiding to account another register read in there is nice. My testing was also mostly done with CLOCK_MONOTONIC_RAW which doesn't seem to be adjusted like CLOCK_MONOTONIC so maybe that why I didn't see the issue. -Lionel
Quoting Lionel Landwerlin (2021-03-04 09:45:47) > On 04/03/2021 10:58, Chris Wilson wrote: > > Quoting Lionel Landwerlin (2021-03-04 08:28:59) > >> On 04/03/2021 02:09, Chris Wilson wrote: > >>> Quoting Umesh Nerlige Ramappa (2021-03-03 21:28:00) > >>>> Perf measurements rely on CPU and engine timestamps to correlate > >>>> events of interest across these time domains. Current mechanisms get > >>>> these timestamps separately and the calculated delta between these > >>>> timestamps lack enough accuracy. > >>>> > >>>> To improve the accuracy of these time measurements to within a few us, > >>>> add a query that returns the engine and cpu timestamps captured as > >>>> close to each other as possible. > >>>> > >>>> v2: (Tvrtko) > >>>> - document clock reference used > >>>> - return cpu timestamp always > >>>> - capture cpu time just before lower dword of cs timestamp > >>>> > >>>> v3: (Chris) > >>>> - use uncore-rpm > >>>> - use __query_cs_timestamp helper > >>>> > >>>> v4: (Lionel) > >>>> - Kernel perf subsytem allows users to specify the clock id to be used > >>>> in perf_event_open. This clock id is used by the perf subsystem to > >>>> return the appropriate cpu timestamp in perf events. Similarly, let > >>>> the user pass the clockid to this query so that cpu timestamp > >>>> corresponds to the clock id requested. > >>>> > >>>> v5: (Tvrtko) > >>>> - Use normal ktime accessors instead of fast versions > >>>> - Add more uApi documentation > >>>> > >>>> v6: (Lionel) > >>>> - Move switch out of spinlock > >>>> > >>>> v7: (Chris) > >>>> - cs_timestamp is a misnomer, use cs_cycles instead > >>>> - return the cs cycle frequency as well in the query > >>>> > >>>> v8: > >>>> - Add platform and engine specific checks > >>>> > >>>> v9: (Lionel) > >>>> - Return 2 cpu timestamps in the query - captured before and after the > >>>> register read > >>>> > >>>> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> > >>>> --- > >>>> drivers/gpu/drm/i915/i915_query.c | 144 ++++++++++++++++++++++++++++++ > >>>> include/uapi/drm/i915_drm.h | 47 ++++++++++ > >>>> 2 files changed, 191 insertions(+) > >>>> > >>>> diff --git a/drivers/gpu/drm/i915/i915_query.c b/drivers/gpu/drm/i915/i915_query.c > >>>> index fed337ad7b68..acca22ee6014 100644 > >>>> --- a/drivers/gpu/drm/i915/i915_query.c > >>>> +++ b/drivers/gpu/drm/i915/i915_query.c > >>>> @@ -6,6 +6,8 @@ > >>>> > >>>> #include <linux/nospec.h> > >>>> > >>>> +#include "gt/intel_engine_pm.h" > >>>> +#include "gt/intel_engine_user.h" > >>>> #include "i915_drv.h" > >>>> #include "i915_perf.h" > >>>> #include "i915_query.h" > >>>> @@ -90,6 +92,147 @@ static int query_topology_info(struct drm_i915_private *dev_priv, > >>>> return total_length; > >>>> } > >>>> > >>>> +typedef u64 (*__ktime_func_t)(void); > >>>> +static __ktime_func_t __clock_id_to_func(clockid_t clk_id) > >>>> +{ > >>>> + /* > >>>> + * Use logic same as the perf subsystem to allow user to select the > >>>> + * reference clock id to be used for timestamps. > >>>> + */ > >>>> + switch (clk_id) { > >>>> + case CLOCK_MONOTONIC: > >>>> + return &ktime_get_ns; > >>>> + case CLOCK_MONOTONIC_RAW: > >>>> + return &ktime_get_raw_ns; > >>>> + case CLOCK_REALTIME: > >>>> + return &ktime_get_real_ns; > >>>> + case CLOCK_BOOTTIME: > >>>> + return &ktime_get_boottime_ns; > >>>> + case CLOCK_TAI: > >>>> + return &ktime_get_clocktai_ns; > >>>> + default: > >>>> + return NULL; > >>>> + } > >>>> +} > >>>> + > >>>> +static inline int > >>>> +__read_timestamps(struct intel_uncore *uncore, > >>>> + i915_reg_t lower_reg, > >>>> + i915_reg_t upper_reg, > >>>> + u64 *cs_ts, > >>>> + u64 *cpu_ts, > >>>> + __ktime_func_t cpu_clock) > >>>> +{ > >>>> + u32 upper, lower, old_upper, loop = 0; > >>>> + > >>>> + upper = intel_uncore_read_fw(uncore, upper_reg); > >>>> + do { > >>>> + cpu_ts[0] = cpu_clock(); > >>>> + lower = intel_uncore_read_fw(uncore, lower_reg); > >>>> + cpu_ts[1] = cpu_clock(); > >>>> + old_upper = upper; > >>>> + upper = intel_uncore_read_fw(uncore, upper_reg); > >>> Both register reads comprise the timestamp returned to userspace, so > >>> presumably you want cpu_ts[] to wrap both. > >>> > >>> do { > >>> old_upper = upper; > >>> > >>> cpu_ts[0] = cpu_clock(); > >>> lower = intel_uncore_read_fw(uncore, lower_reg); > >>> upper = intel_uncore_read_fw(uncore, upper_reg); > >>> cpu_ts[1] = cpu_clock(); > >>> } while (upper != old_upper && loop++ < 2); > >> Actually if we want the best accuracy we can just deal with the lower dword. > > Accuracy of what? The lower dword read perhaps, or the accuracy of the > > sample point for the combined reads for the timestamp, which is closer > > to an external observer (cpu_clock() implies reference to an external > > observer). > > > > The two clock samples are not even necessarily closely related due to the > > nmi adjustments. If you wanted an unadjusted elapsed time for the read > > you can use local_clock() then return the chosen cpu_clock() before plus > > the elapsed delta from around the read as the estimated error. > > > > cpu_ts[1] = local_clock(); > > cpu_ts[0] = cpu_clock(); > > lower = intel_uncore_read_fw(uncore, lower_reg); > > cpu_ts[1] = local_clock() - cpu_ts[1]; > > -Chris > > Thanks, > > > I meant the accuracy of having 2 samples GPU/CPU as close as possible. > > Avoiding to account another register read in there is nice. > > > My testing was also mostly done with CLOCK_MONOTONIC_RAW which doesn't > seem to be adjusted like CLOCK_MONOTONIC so maybe that why I didn't see > the issue. _RAW is still adjusted for skews, just not coupled into the ntp feedback. That is less obvious than the other clocks, and why it's preferred for comparing against other HW sources. But two reads of _RAW are only monotonic, not necessarily on the same time base. local_clock() is tsc/arat, so counting the CPU cycles between the two reads with the frequency (at least on x86) held constant (and arat should be frequency invariant). If we want much better accuracy, we are supposed to use cyclecounter_t and the system_device_crosststamp. -Chris
On 04/03/2021 11:54, Chris Wilson wrote: >>>> Actually if we want the best accuracy we can just deal with the lower dword. >>> Accuracy of what? The lower dword read perhaps, or the accuracy of the >>> sample point for the combined reads for the timestamp, which is closer >>> to an external observer (cpu_clock() implies reference to an external >>> observer). >>> >>> The two clock samples are not even necessarily closely related due to the >>> nmi adjustments. If you wanted an unadjusted elapsed time for the read >>> you can use local_clock() then return the chosen cpu_clock() before plus >>> the elapsed delta from around the read as the estimated error. >>> >>> cpu_ts[1] = local_clock(); >>> cpu_ts[0] = cpu_clock(); >>> lower = intel_uncore_read_fw(uncore, lower_reg); >>> cpu_ts[1] = local_clock() - cpu_ts[1]; >>> -Chris >> Thanks, >> >> >> I meant the accuracy of having 2 samples GPU/CPU as close as possible. >> >> Avoiding to account another register read in there is nice. >> >> >> My testing was also mostly done with CLOCK_MONOTONIC_RAW which doesn't >> seem to be adjusted like CLOCK_MONOTONIC so maybe that why I didn't see >> the issue. > _RAW is still adjusted for skews, just not coupled into the ntp feedback. > That is less obvious than the other clocks, and why it's preferred for > comparing against other HW sources. But two reads of _RAW are only > monotonic, not necessarily on the same time base. local_clock() is > tsc/arat, so counting the CPU cycles between the two reads with the > frequency (at least on x86) held constant (and arat should be frequency > invariant). > > If we want much better accuracy, we are supposed to use cyclecounter_t > and the system_device_crosststamp. > -Chris Thanks for the pointers. I think people are mostly trying to map what's coming out of OA or queries from the various command streamers back to perf/ftrace. As far I know perf will only let you select a clockid. So maybe cyclecounter_t is not that useful atm. -Lionel
On 03/03/2021 23:28, Umesh Nerlige Ramappa wrote: > Perf measurements rely on CPU and engine timestamps to correlate > events of interest across these time domains. Current mechanisms get > these timestamps separately and the calculated delta between these > timestamps lack enough accuracy. > > To improve the accuracy of these time measurements to within a few us, > add a query that returns the engine and cpu timestamps captured as > close to each other as possible. > > v2: (Tvrtko) > - document clock reference used > - return cpu timestamp always > - capture cpu time just before lower dword of cs timestamp > > v3: (Chris) > - use uncore-rpm > - use __query_cs_timestamp helper > > v4: (Lionel) > - Kernel perf subsytem allows users to specify the clock id to be used > in perf_event_open. This clock id is used by the perf subsystem to > return the appropriate cpu timestamp in perf events. Similarly, let > the user pass the clockid to this query so that cpu timestamp > corresponds to the clock id requested. > > v5: (Tvrtko) > - Use normal ktime accessors instead of fast versions > - Add more uApi documentation > > v6: (Lionel) > - Move switch out of spinlock > > v7: (Chris) > - cs_timestamp is a misnomer, use cs_cycles instead > - return the cs cycle frequency as well in the query > > v8: > - Add platform and engine specific checks > > v9: (Lionel) > - Return 2 cpu timestamps in the query - captured before and after the > register read > > Signed-off-by: Umesh Nerlige Ramappa<umesh.nerlige.ramappa@intel.com> > --- > drivers/gpu/drm/i915/i915_query.c | 144 ++++++++++++++++++++++++++++++ FYI, the MR for Mesa : https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/9407 -Lionel
diff --git a/drivers/gpu/drm/i915/i915_query.c b/drivers/gpu/drm/i915/i915_query.c index fed337ad7b68..acca22ee6014 100644 --- a/drivers/gpu/drm/i915/i915_query.c +++ b/drivers/gpu/drm/i915/i915_query.c @@ -6,6 +6,8 @@ #include <linux/nospec.h> +#include "gt/intel_engine_pm.h" +#include "gt/intel_engine_user.h" #include "i915_drv.h" #include "i915_perf.h" #include "i915_query.h" @@ -90,6 +92,147 @@ static int query_topology_info(struct drm_i915_private *dev_priv, return total_length; } +typedef u64 (*__ktime_func_t)(void); +static __ktime_func_t __clock_id_to_func(clockid_t clk_id) +{ + /* + * Use logic same as the perf subsystem to allow user to select the + * reference clock id to be used for timestamps. + */ + switch (clk_id) { + case CLOCK_MONOTONIC: + return &ktime_get_ns; + case CLOCK_MONOTONIC_RAW: + return &ktime_get_raw_ns; + case CLOCK_REALTIME: + return &ktime_get_real_ns; + case CLOCK_BOOTTIME: + return &ktime_get_boottime_ns; + case CLOCK_TAI: + return &ktime_get_clocktai_ns; + default: + return NULL; + } +} + +static inline int +__read_timestamps(struct intel_uncore *uncore, + i915_reg_t lower_reg, + i915_reg_t upper_reg, + u64 *cs_ts, + u64 *cpu_ts, + __ktime_func_t cpu_clock) +{ + u32 upper, lower, old_upper, loop = 0; + + upper = intel_uncore_read_fw(uncore, upper_reg); + do { + cpu_ts[0] = cpu_clock(); + lower = intel_uncore_read_fw(uncore, lower_reg); + cpu_ts[1] = cpu_clock(); + old_upper = upper; + upper = intel_uncore_read_fw(uncore, upper_reg); + } while (upper != old_upper && loop++ < 2); + + *cs_ts = (u64)upper << 32 | lower; + + return 0; +} + +static int +__query_cs_cycles(struct intel_engine_cs *engine, + u64 *cs_ts, u64 *cpu_ts, + __ktime_func_t cpu_clock) +{ + struct intel_uncore *uncore = engine->uncore; + enum forcewake_domains fw_domains; + u32 base = engine->mmio_base; + intel_wakeref_t wakeref; + int ret; + + fw_domains = intel_uncore_forcewake_for_reg(uncore, + RING_TIMESTAMP(base), + FW_REG_READ); + + with_intel_runtime_pm(uncore->rpm, wakeref) { + spin_lock_irq(&uncore->lock); + intel_uncore_forcewake_get__locked(uncore, fw_domains); + + ret = __read_timestamps(uncore, + RING_TIMESTAMP(base), + RING_TIMESTAMP_UDW(base), + cs_ts, + cpu_ts, + cpu_clock); + + intel_uncore_forcewake_put__locked(uncore, fw_domains); + spin_unlock_irq(&uncore->lock); + } + + return ret; +} + +static int +query_cs_cycles(struct drm_i915_private *i915, + struct drm_i915_query_item *query_item) +{ + struct drm_i915_query_cs_cycles __user *query_ptr; + struct drm_i915_query_cs_cycles query; + struct intel_engine_cs *engine; + __ktime_func_t cpu_clock; + int ret; + + if (INTEL_GEN(i915) < 6) + return -ENODEV; + + query_ptr = u64_to_user_ptr(query_item->data_ptr); + ret = copy_query_item(&query, sizeof(query), sizeof(query), query_item); + if (ret != 0) + return ret; + + if (query.flags) + return -EINVAL; + + if (query.rsvd) + return -EINVAL; + + cpu_clock = __clock_id_to_func(query.clockid); + if (!cpu_clock) + return -EINVAL; + + engine = intel_engine_lookup_user(i915, + query.engine.engine_class, + query.engine.engine_instance); + if (!engine) + return -EINVAL; + + if (IS_GEN(i915, 6) && + query.engine.engine_class != I915_ENGINE_CLASS_RENDER) + return -ENODEV; + + query.cs_frequency = engine->gt->clock_frequency; + ret = __query_cs_cycles(engine, + &query.cs_cycles, + query.cpu_timestamp, + cpu_clock); + if (ret) + return ret; + + if (put_user(query.cs_frequency, &query_ptr->cs_frequency)) + return -EFAULT; + + if (put_user(query.cpu_timestamp[0], &query_ptr->cpu_timestamp[0])) + return -EFAULT; + + if (put_user(query.cpu_timestamp[1], &query_ptr->cpu_timestamp[1])) + return -EFAULT; + + if (put_user(query.cs_cycles, &query_ptr->cs_cycles)) + return -EFAULT; + + return sizeof(query); +} + static int query_engine_info(struct drm_i915_private *i915, struct drm_i915_query_item *query_item) @@ -424,6 +567,7 @@ static int (* const i915_query_funcs[])(struct drm_i915_private *dev_priv, query_topology_info, query_engine_info, query_perf_config, + query_cs_cycles, }; int i915_query_ioctl(struct drm_device *dev, void *data, struct drm_file *file) diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h index 1987e2ea79a3..abcc479e2be1 100644 --- a/include/uapi/drm/i915_drm.h +++ b/include/uapi/drm/i915_drm.h @@ -2176,6 +2176,10 @@ struct drm_i915_query_item { #define DRM_I915_QUERY_TOPOLOGY_INFO 1 #define DRM_I915_QUERY_ENGINE_INFO 2 #define DRM_I915_QUERY_PERF_CONFIG 3 + /** + * Query Command Streamer timestamp register. + */ +#define DRM_I915_QUERY_CS_CYCLES 4 /* Must be kept compact -- no holes and well documented */ /* @@ -2309,6 +2313,49 @@ struct drm_i915_engine_info { __u64 rsvd1[4]; }; +/** + * struct drm_i915_query_cs_cycles + * + * The query returns the command streamer cycles and the frequency that can be + * used to calculate the command streamer timestamp. In addition the query + * returns the cpu timestamp that indicates when the command streamer cycle + * count was captured. + */ +struct drm_i915_query_cs_cycles { + /** Engine for which command streamer cycles is queried. */ + struct i915_engine_class_instance engine; + + /** Must be zero. */ + __u32 flags; + + /** + * Command streamer cycles as read from the command streamer + * register at 0x358 offset. + */ + __u64 cs_cycles; + + /** Frequency of the cs cycles in Hz. */ + __u64 cs_frequency; + + /** + * CPU timestamp in nanoseconds. cpu_timestamp[0] is captured before + * reading the cs_cycles register and cpu_timestamp[1] is captured after + * reading the register. + **/ + __u64 cpu_timestamp[2]; + + /** + * Reference clock id for CPU timestamp. For definition, see + * clock_gettime(2) and perf_event_open(2). Supported clock ids are + * CLOCK_MONOTONIC, CLOCK_MONOTONIC_RAW, CLOCK_REALTIME, CLOCK_BOOTTIME, + * CLOCK_TAI. + */ + __s32 clockid; + + /** Must be zero. */ + __u32 rsvd; +}; + /** * struct drm_i915_query_engine_info *
Perf measurements rely on CPU and engine timestamps to correlate events of interest across these time domains. Current mechanisms get these timestamps separately and the calculated delta between these timestamps lack enough accuracy. To improve the accuracy of these time measurements to within a few us, add a query that returns the engine and cpu timestamps captured as close to each other as possible. v2: (Tvrtko) - document clock reference used - return cpu timestamp always - capture cpu time just before lower dword of cs timestamp v3: (Chris) - use uncore-rpm - use __query_cs_timestamp helper v4: (Lionel) - Kernel perf subsytem allows users to specify the clock id to be used in perf_event_open. This clock id is used by the perf subsystem to return the appropriate cpu timestamp in perf events. Similarly, let the user pass the clockid to this query so that cpu timestamp corresponds to the clock id requested. v5: (Tvrtko) - Use normal ktime accessors instead of fast versions - Add more uApi documentation v6: (Lionel) - Move switch out of spinlock v7: (Chris) - cs_timestamp is a misnomer, use cs_cycles instead - return the cs cycle frequency as well in the query v8: - Add platform and engine specific checks v9: (Lionel) - Return 2 cpu timestamps in the query - captured before and after the register read Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> --- drivers/gpu/drm/i915/i915_query.c | 144 ++++++++++++++++++++++++++++++ include/uapi/drm/i915_drm.h | 47 ++++++++++ 2 files changed, 191 insertions(+)