Message ID | 20190627080045.8814-9-lionel.g.landwerlin@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/i915: Vulkan performance query support | expand |
Quoting Lionel Landwerlin (2019-06-27 09:00:43) > We would like to make use of perf in Vulkan. The Vulkan API is much > lower level than OpenGL, with applications directly exposed to the > concept of command buffers (pretty much equivalent to our batch > buffers). In Vulkan, queries are always limited in scope to a command > buffer. In OpenGL, the lack of command buffer concept meant that > queries' duration could span multiple command buffers. > > With that restriction gone in Vulkan, we would like to simplify > measuring performance just by measuring the deltas between the counter > snapshots written by 2 MI_RECORD_PERF_COUNT commands, rather than the > more complex scheme we currently have in the GL driver, using 2 > MI_RECORD_PERF_COUNT commands and doing some post processing on the > stream of OA reports, coming from the global OA buffer, to remove any > unrelated deltas in between the 2 MI_RECORD_PERF_COUNT. > > Disabling preemption only apply to a single context with which want to > query performance counters for and is considered a privileged > operation, by default protected by CAP_SYS_ADMIN. It is possible to > enable it for a normal user by disabling the paranoid stream setting. > > v2: Store preemption setting in intel_context (Chris) > > v3: Use priorities to avoid preemption rather than the HW mechanism > > v4: Just modify the port priority reporting function > > Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> > --- > drivers/gpu/drm/i915/gt/intel_lrc.c | 7 ++++++- > drivers/gpu/drm/i915/i915_drv.c | 2 +- > drivers/gpu/drm/i915/i915_drv.h | 13 ++++++++++++ > drivers/gpu/drm/i915/i915_perf.c | 22 +++++++++++++++++++-- > drivers/gpu/drm/i915/i915_priolist_types.h | 7 +++++++ > drivers/gpu/drm/i915/i915_request.c | 1 + > drivers/gpu/drm/i915/i915_request.h | 1 + > drivers/gpu/drm/i915/intel_guc_submission.c | 12 ++++++++++- > include/uapi/drm/i915_drm.h | 11 +++++++++++ > 9 files changed, 71 insertions(+), 5 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c > index a6ea468986d1..52f4d69cdb2f 100644 > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c > @@ -256,7 +256,12 @@ static inline int rq_prio(const struct i915_request *rq) > > static int effective_prio(const struct i915_request *rq) > { > - int prio = rq_prio(rq); > + int prio; > + > + if (rq->has_perf) > + prio = I915_USER_PRIORITY(I915_PRIORITY_PERF); > + else > + prio = rq_prio(rq); This doesn't cover everything. You will get timesliced now. But I do think this is the safest approach that evades all the PI sanity checks and possible deadlocks. There is also the problem where we check the priority on the last request in the queue, but preemption may need to be disabled at the head of the queue (the executing request). Is that a concern? I was thinking of marking the context as non-preemptible until after the oa is removed. > /* > * On unwinding the active request, we give it a priority bump > diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c > index ef54cf1c54a5..68e29b1ed1c8 100644 > --- a/drivers/gpu/drm/i915/i915_drv.c > +++ b/drivers/gpu/drm/i915/i915_drv.c > @@ -484,7 +484,7 @@ static int i915_getparam_ioctl(struct drm_device *dev, void *data, > value = INTEL_INFO(dev_priv)->has_coherent_ggtt; > break; > case I915_PARAM_PERF_REVISION: > - value = 1; > + value = 2; > break; > case I915_PARAM_HAS_EXEC_PERF_CONFIG: > /* Obviously requires perf support. */ > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h > index 51b103493623..f1b8a0ada986 100644 > --- a/drivers/gpu/drm/i915/i915_drv.h > +++ b/drivers/gpu/drm/i915/i915_drv.h > @@ -1232,6 +1232,19 @@ struct i915_perf_stream { > */ > bool enabled; > > + /** > + * @hold_preemption: Whether preemption is put on hold for command > + * submissions done on the @ctx. This is useful for some drivers that > + * cannot easily post process the OA buffer context to subtract delta > + * of performance counters not associated with @ctx. > + */ > + bool hold_preemption; > + > + /** > + * @last_config_request > + */ > + struct i915_request *last_config_request; > + > /** > * @ops: The callbacks providing the implementation of this specific > * type of configured stream. > diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c > index 34d74fa64c74..bf4f5fee6764 100644 > --- a/drivers/gpu/drm/i915/i915_perf.c > +++ b/drivers/gpu/drm/i915/i915_perf.c > @@ -343,6 +343,8 @@ static const struct i915_oa_format gen8_plus_oa_formats[I915_OA_FORMAT_MAX] = { > * struct perf_open_properties - for validated properties given to open a stream > * @sample_flags: `DRM_I915_PERF_PROP_SAMPLE_*` properties are tracked as flags > * @single_context: Whether a single or all gpu contexts should be monitored > + * @hold_preemption: Whether the preemption is disabled for the filtered > + * context > * @ctx_handle: A gem ctx handle for use with @single_context > * @metrics_set: An ID for an OA unit metric set advertised via sysfs > * @oa_format: An OA unit HW report format > @@ -357,6 +359,7 @@ struct perf_open_properties { > u32 sample_flags; > > u64 single_context:1; > + u64 hold_preemption:1; > u64 ctx_handle; > > /* OA sampling state */ > @@ -2352,6 +2355,8 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream, > stream->sample_flags |= SAMPLE_OA_REPORT; > stream->sample_size += format_size; > > + stream->hold_preemption = props->hold_preemption; > + > dev_priv->perf.oa.oa_buffer.format_size = format_size; > if (WARN_ON(dev_priv->perf.oa.oa_buffer.format_size == 0)) > return -EINVAL; > @@ -2890,6 +2895,15 @@ i915_perf_open_ioctl_locked(struct drm_i915_private *dev_priv, > } > } > > + if (props->hold_preemption) { > + if (!props->single_context) { > + DRM_DEBUG("preemption disable with no context\n"); > + ret = -EINVAL; > + goto err; > + } > + privileged_op = true; > + } > + > /* > * On Haswell the OA unit supports clock gating off for a specific > * context and in this mode there's no visibility of metrics for the > @@ -2904,8 +2918,9 @@ i915_perf_open_ioctl_locked(struct drm_i915_private *dev_priv, > * MI_REPORT_PERF_COUNT commands and so consider it a privileged op to > * enable the OA unit by default. > */ > - if (IS_HASWELL(dev_priv) && specific_ctx) > + if (IS_HASWELL(dev_priv) && specific_ctx && !props->hold_preemption) { > privileged_op = false; > + } > > /* Similar to perf's kernel.perf_paranoid_cpu sysctl option > * we check a dev.i915.perf_stream_paranoid sysctl option > @@ -2914,7 +2929,7 @@ i915_perf_open_ioctl_locked(struct drm_i915_private *dev_priv, > */ > if (privileged_op && > i915_perf_stream_paranoid && !capable(CAP_SYS_ADMIN)) { > - DRM_DEBUG("Insufficient privileges to open system-wide i915 perf stream\n"); > + DRM_DEBUG("Insufficient privileges to open i915 perf stream\n"); > ret = -EACCES; > goto err_ctx; > } > @@ -3106,6 +3121,9 @@ static int read_properties_unlocked(struct drm_i915_private *dev_priv, > props->oa_periodic = true; > props->oa_period_exponent = value; > break; > + case DRM_I915_PERF_PROP_HOLD_PREEMPTION: > + props->hold_preemption = value != 0 ? 1 : 0; !!value. > + break; > case DRM_I915_PERF_PROP_MAX: > MISSING_CASE(id); > return -EINVAL; > diff --git a/drivers/gpu/drm/i915/i915_priolist_types.h b/drivers/gpu/drm/i915/i915_priolist_types.h > index 49709de69875..fc1c1fe7ce58 100644 > --- a/drivers/gpu/drm/i915/i915_priolist_types.h > +++ b/drivers/gpu/drm/i915/i915_priolist_types.h > @@ -17,6 +17,13 @@ enum { > I915_PRIORITY_NORMAL = I915_CONTEXT_DEFAULT_PRIORITY, > I915_PRIORITY_MAX = I915_CONTEXT_MAX_USER_PRIORITY + 1, > > + /* Requests containing performance queries must not be preempted by > + * another context. They get scheduled with their default priority and > + * once they reach the execlist ports we bump them to > + * I915_PRIORITY_PERF so that they stick to the HW until they finish. > + */ > + I915_PRIORITY_PERF = I915_CONTEXT_MAX_USER_PRIORITY + 2, Just let it be next, one above max? > + > I915_PRIORITY_INVALID = INT_MIN > }; > > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c > index 5ff87c4a0cd5..44b4b2bfedf9 100644 > --- a/drivers/gpu/drm/i915/i915_request.c > +++ b/drivers/gpu/drm/i915/i915_request.c > @@ -685,6 +685,7 @@ __i915_request_create(struct intel_context *ce, gfp_t gfp) > rq->batch = NULL; > rq->capture_list = NULL; > rq->waitboost = false; > + rq->has_perf = false; > rq->execution_mask = ALL_ENGINES; > > INIT_LIST_HEAD(&rq->active_list); > diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h > index b58ceef92e20..21be6b44602d 100644 > --- a/drivers/gpu/drm/i915/i915_request.h > +++ b/drivers/gpu/drm/i915/i915_request.h > @@ -217,6 +217,7 @@ struct i915_request { > unsigned long emitted_jiffies; > > bool waitboost; > + bool has_perf; 2 bools, probably time for a flags if we stick with this approach. Also note previous patch didn't compile but rq->perf was out of sequence. -Chris
On 27/06/2019 12:53, Chris Wilson wrote: > Quoting Lionel Landwerlin (2019-06-27 09:00:43) >> We would like to make use of perf in Vulkan. The Vulkan API is much >> lower level than OpenGL, with applications directly exposed to the >> concept of command buffers (pretty much equivalent to our batch >> buffers). In Vulkan, queries are always limited in scope to a command >> buffer. In OpenGL, the lack of command buffer concept meant that >> queries' duration could span multiple command buffers. >> >> With that restriction gone in Vulkan, we would like to simplify >> measuring performance just by measuring the deltas between the counter >> snapshots written by 2 MI_RECORD_PERF_COUNT commands, rather than the >> more complex scheme we currently have in the GL driver, using 2 >> MI_RECORD_PERF_COUNT commands and doing some post processing on the >> stream of OA reports, coming from the global OA buffer, to remove any >> unrelated deltas in between the 2 MI_RECORD_PERF_COUNT. >> >> Disabling preemption only apply to a single context with which want to >> query performance counters for and is considered a privileged >> operation, by default protected by CAP_SYS_ADMIN. It is possible to >> enable it for a normal user by disabling the paranoid stream setting. >> >> v2: Store preemption setting in intel_context (Chris) >> >> v3: Use priorities to avoid preemption rather than the HW mechanism >> >> v4: Just modify the port priority reporting function >> >> Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> >> --- >> drivers/gpu/drm/i915/gt/intel_lrc.c | 7 ++++++- >> drivers/gpu/drm/i915/i915_drv.c | 2 +- >> drivers/gpu/drm/i915/i915_drv.h | 13 ++++++++++++ >> drivers/gpu/drm/i915/i915_perf.c | 22 +++++++++++++++++++-- >> drivers/gpu/drm/i915/i915_priolist_types.h | 7 +++++++ >> drivers/gpu/drm/i915/i915_request.c | 1 + >> drivers/gpu/drm/i915/i915_request.h | 1 + >> drivers/gpu/drm/i915/intel_guc_submission.c | 12 ++++++++++- >> include/uapi/drm/i915_drm.h | 11 +++++++++++ >> 9 files changed, 71 insertions(+), 5 deletions(-) >> >> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c >> index a6ea468986d1..52f4d69cdb2f 100644 >> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c >> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c >> @@ -256,7 +256,12 @@ static inline int rq_prio(const struct i915_request *rq) >> >> static int effective_prio(const struct i915_request *rq) >> { >> - int prio = rq_prio(rq); >> + int prio; >> + >> + if (rq->has_perf) >> + prio = I915_USER_PRIORITY(I915_PRIORITY_PERF); >> + else >> + prio = rq_prio(rq); > This doesn't cover everything. You will get timesliced now. > > But I do think this is the safest approach that evades all the PI sanity > checks and possible deadlocks. > > There is also the problem where we check the priority on the last > request in the queue, but preemption may need to be disabled at the head > of the queue (the executing request). Is that a concern? I was thinking > of marking the context as non-preemptible until after the oa is removed. What do you mean by timesliced? I don't really have a good way of working around the global counters so whatever you think is best to allow queries in batch buffers to complete. > >> /* >> * On unwinding the active request, we give it a priority bump >> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c >> index ef54cf1c54a5..68e29b1ed1c8 100644 >> --- a/drivers/gpu/drm/i915/i915_drv.c >> +++ b/drivers/gpu/drm/i915/i915_drv.c >> @@ -484,7 +484,7 @@ static int i915_getparam_ioctl(struct drm_device *dev, void *data, >> value = INTEL_INFO(dev_priv)->has_coherent_ggtt; >> break; >> case I915_PARAM_PERF_REVISION: >> - value = 1; >> + value = 2; >> break; >> case I915_PARAM_HAS_EXEC_PERF_CONFIG: >> /* Obviously requires perf support. */ >> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h >> index 51b103493623..f1b8a0ada986 100644 >> --- a/drivers/gpu/drm/i915/i915_drv.h >> +++ b/drivers/gpu/drm/i915/i915_drv.h >> @@ -1232,6 +1232,19 @@ struct i915_perf_stream { >> */ >> bool enabled; >> >> + /** >> + * @hold_preemption: Whether preemption is put on hold for command >> + * submissions done on the @ctx. This is useful for some drivers that >> + * cannot easily post process the OA buffer context to subtract delta >> + * of performance counters not associated with @ctx. >> + */ >> + bool hold_preemption; >> + >> + /** >> + * @last_config_request >> + */ >> + struct i915_request *last_config_request; >> + >> /** >> * @ops: The callbacks providing the implementation of this specific >> * type of configured stream. >> diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c >> index 34d74fa64c74..bf4f5fee6764 100644 >> --- a/drivers/gpu/drm/i915/i915_perf.c >> +++ b/drivers/gpu/drm/i915/i915_perf.c >> @@ -343,6 +343,8 @@ static const struct i915_oa_format gen8_plus_oa_formats[I915_OA_FORMAT_MAX] = { >> * struct perf_open_properties - for validated properties given to open a stream >> * @sample_flags: `DRM_I915_PERF_PROP_SAMPLE_*` properties are tracked as flags >> * @single_context: Whether a single or all gpu contexts should be monitored >> + * @hold_preemption: Whether the preemption is disabled for the filtered >> + * context >> * @ctx_handle: A gem ctx handle for use with @single_context >> * @metrics_set: An ID for an OA unit metric set advertised via sysfs >> * @oa_format: An OA unit HW report format >> @@ -357,6 +359,7 @@ struct perf_open_properties { >> u32 sample_flags; >> >> u64 single_context:1; >> + u64 hold_preemption:1; >> u64 ctx_handle; >> >> /* OA sampling state */ >> @@ -2352,6 +2355,8 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream, >> stream->sample_flags |= SAMPLE_OA_REPORT; >> stream->sample_size += format_size; >> >> + stream->hold_preemption = props->hold_preemption; >> + >> dev_priv->perf.oa.oa_buffer.format_size = format_size; >> if (WARN_ON(dev_priv->perf.oa.oa_buffer.format_size == 0)) >> return -EINVAL; >> @@ -2890,6 +2895,15 @@ i915_perf_open_ioctl_locked(struct drm_i915_private *dev_priv, >> } >> } >> >> + if (props->hold_preemption) { >> + if (!props->single_context) { >> + DRM_DEBUG("preemption disable with no context\n"); >> + ret = -EINVAL; >> + goto err; >> + } >> + privileged_op = true; >> + } >> + >> /* >> * On Haswell the OA unit supports clock gating off for a specific >> * context and in this mode there's no visibility of metrics for the >> @@ -2904,8 +2918,9 @@ i915_perf_open_ioctl_locked(struct drm_i915_private *dev_priv, >> * MI_REPORT_PERF_COUNT commands and so consider it a privileged op to >> * enable the OA unit by default. >> */ >> - if (IS_HASWELL(dev_priv) && specific_ctx) >> + if (IS_HASWELL(dev_priv) && specific_ctx && !props->hold_preemption) { >> privileged_op = false; >> + } >> >> /* Similar to perf's kernel.perf_paranoid_cpu sysctl option >> * we check a dev.i915.perf_stream_paranoid sysctl option >> @@ -2914,7 +2929,7 @@ i915_perf_open_ioctl_locked(struct drm_i915_private *dev_priv, >> */ >> if (privileged_op && >> i915_perf_stream_paranoid && !capable(CAP_SYS_ADMIN)) { >> - DRM_DEBUG("Insufficient privileges to open system-wide i915 perf stream\n"); >> + DRM_DEBUG("Insufficient privileges to open i915 perf stream\n"); >> ret = -EACCES; >> goto err_ctx; >> } >> @@ -3106,6 +3121,9 @@ static int read_properties_unlocked(struct drm_i915_private *dev_priv, >> props->oa_periodic = true; >> props->oa_period_exponent = value; >> break; >> + case DRM_I915_PERF_PROP_HOLD_PREEMPTION: >> + props->hold_preemption = value != 0 ? 1 : 0; > !!value. Done. > >> + break; >> case DRM_I915_PERF_PROP_MAX: >> MISSING_CASE(id); >> return -EINVAL; >> diff --git a/drivers/gpu/drm/i915/i915_priolist_types.h b/drivers/gpu/drm/i915/i915_priolist_types.h >> index 49709de69875..fc1c1fe7ce58 100644 >> --- a/drivers/gpu/drm/i915/i915_priolist_types.h >> +++ b/drivers/gpu/drm/i915/i915_priolist_types.h >> @@ -17,6 +17,13 @@ enum { >> I915_PRIORITY_NORMAL = I915_CONTEXT_DEFAULT_PRIORITY, >> I915_PRIORITY_MAX = I915_CONTEXT_MAX_USER_PRIORITY + 1, >> >> + /* Requests containing performance queries must not be preempted by >> + * another context. They get scheduled with their default priority and >> + * once they reach the execlist ports we bump them to >> + * I915_PRIORITY_PERF so that they stick to the HW until they finish. >> + */ >> + I915_PRIORITY_PERF = I915_CONTEXT_MAX_USER_PRIORITY + 2, > Just let it be next, one above max? Done. > >> + >> I915_PRIORITY_INVALID = INT_MIN >> }; >> >> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c >> index 5ff87c4a0cd5..44b4b2bfedf9 100644 >> --- a/drivers/gpu/drm/i915/i915_request.c >> +++ b/drivers/gpu/drm/i915/i915_request.c >> @@ -685,6 +685,7 @@ __i915_request_create(struct intel_context *ce, gfp_t gfp) >> rq->batch = NULL; >> rq->capture_list = NULL; >> rq->waitboost = false; >> + rq->has_perf = false; >> rq->execution_mask = ALL_ENGINES; >> >> INIT_LIST_HEAD(&rq->active_list); >> diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h >> index b58ceef92e20..21be6b44602d 100644 >> --- a/drivers/gpu/drm/i915/i915_request.h >> +++ b/drivers/gpu/drm/i915/i915_request.h >> @@ -217,6 +217,7 @@ struct i915_request { >> unsigned long emitted_jiffies; >> >> bool waitboost; >> + bool has_perf; > 2 bools, probably time for a flags if we stick with this approach. > > Also note previous patch didn't compile but rq->perf was out of > sequence. > -Chris > Done.
Quoting Lionel Landwerlin (2019-06-27 15:27:40) > On 27/06/2019 12:53, Chris Wilson wrote: > > Quoting Lionel Landwerlin (2019-06-27 09:00:43) > >> We would like to make use of perf in Vulkan. The Vulkan API is much > >> lower level than OpenGL, with applications directly exposed to the > >> concept of command buffers (pretty much equivalent to our batch > >> buffers). In Vulkan, queries are always limited in scope to a command > >> buffer. In OpenGL, the lack of command buffer concept meant that > >> queries' duration could span multiple command buffers. > >> > >> With that restriction gone in Vulkan, we would like to simplify > >> measuring performance just by measuring the deltas between the counter > >> snapshots written by 2 MI_RECORD_PERF_COUNT commands, rather than the > >> more complex scheme we currently have in the GL driver, using 2 > >> MI_RECORD_PERF_COUNT commands and doing some post processing on the > >> stream of OA reports, coming from the global OA buffer, to remove any > >> unrelated deltas in between the 2 MI_RECORD_PERF_COUNT. > >> > >> Disabling preemption only apply to a single context with which want to > >> query performance counters for and is considered a privileged > >> operation, by default protected by CAP_SYS_ADMIN. It is possible to > >> enable it for a normal user by disabling the paranoid stream setting. > >> > >> v2: Store preemption setting in intel_context (Chris) > >> > >> v3: Use priorities to avoid preemption rather than the HW mechanism > >> > >> v4: Just modify the port priority reporting function > >> > >> Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> > >> --- > >> drivers/gpu/drm/i915/gt/intel_lrc.c | 7 ++++++- > >> drivers/gpu/drm/i915/i915_drv.c | 2 +- > >> drivers/gpu/drm/i915/i915_drv.h | 13 ++++++++++++ > >> drivers/gpu/drm/i915/i915_perf.c | 22 +++++++++++++++++++-- > >> drivers/gpu/drm/i915/i915_priolist_types.h | 7 +++++++ > >> drivers/gpu/drm/i915/i915_request.c | 1 + > >> drivers/gpu/drm/i915/i915_request.h | 1 + > >> drivers/gpu/drm/i915/intel_guc_submission.c | 12 ++++++++++- > >> include/uapi/drm/i915_drm.h | 11 +++++++++++ > >> 9 files changed, 71 insertions(+), 5 deletions(-) > >> > >> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c > >> index a6ea468986d1..52f4d69cdb2f 100644 > >> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c > >> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c > >> @@ -256,7 +256,12 @@ static inline int rq_prio(const struct i915_request *rq) > >> > >> static int effective_prio(const struct i915_request *rq) > >> { > >> - int prio = rq_prio(rq); > >> + int prio; > >> + > >> + if (rq->has_perf) > >> + prio = I915_USER_PRIORITY(I915_PRIORITY_PERF); > >> + else > >> + prio = rq_prio(rq); > > This doesn't cover everything. You will get timesliced now. > > > > But I do think this is the safest approach that evades all the PI sanity > > checks and possible deadlocks. > > > > There is also the problem where we check the priority on the last > > request in the queue, but preemption may need to be disabled at the head > > of the queue (the executing request). Is that a concern? I was thinking > > of marking the context as non-preemptible until after the oa is removed. > > > What do you mean by timesliced? Actually just sent the patch that would allow this to apply to the other preemption path. > I don't really have a good way of working around the global counters so > whatever you think is best to allow queries in batch buffers to complete. Gut feeling is that if you need preemption protection, it should be on the context. Otherwise it seems easy to create a situation where the request being tracked for the context submission doesn't have the oa flag, but is masking one that does. -Chris
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c index a6ea468986d1..52f4d69cdb2f 100644 --- a/drivers/gpu/drm/i915/gt/intel_lrc.c +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c @@ -256,7 +256,12 @@ static inline int rq_prio(const struct i915_request *rq) static int effective_prio(const struct i915_request *rq) { - int prio = rq_prio(rq); + int prio; + + if (rq->has_perf) + prio = I915_USER_PRIORITY(I915_PRIORITY_PERF); + else + prio = rq_prio(rq); /* * On unwinding the active request, we give it a priority bump diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c index ef54cf1c54a5..68e29b1ed1c8 100644 --- a/drivers/gpu/drm/i915/i915_drv.c +++ b/drivers/gpu/drm/i915/i915_drv.c @@ -484,7 +484,7 @@ static int i915_getparam_ioctl(struct drm_device *dev, void *data, value = INTEL_INFO(dev_priv)->has_coherent_ggtt; break; case I915_PARAM_PERF_REVISION: - value = 1; + value = 2; break; case I915_PARAM_HAS_EXEC_PERF_CONFIG: /* Obviously requires perf support. */ diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 51b103493623..f1b8a0ada986 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -1232,6 +1232,19 @@ struct i915_perf_stream { */ bool enabled; + /** + * @hold_preemption: Whether preemption is put on hold for command + * submissions done on the @ctx. This is useful for some drivers that + * cannot easily post process the OA buffer context to subtract delta + * of performance counters not associated with @ctx. + */ + bool hold_preemption; + + /** + * @last_config_request + */ + struct i915_request *last_config_request; + /** * @ops: The callbacks providing the implementation of this specific * type of configured stream. diff --git a/drivers/gpu/drm/i915/i915_perf.c b/drivers/gpu/drm/i915/i915_perf.c index 34d74fa64c74..bf4f5fee6764 100644 --- a/drivers/gpu/drm/i915/i915_perf.c +++ b/drivers/gpu/drm/i915/i915_perf.c @@ -343,6 +343,8 @@ static const struct i915_oa_format gen8_plus_oa_formats[I915_OA_FORMAT_MAX] = { * struct perf_open_properties - for validated properties given to open a stream * @sample_flags: `DRM_I915_PERF_PROP_SAMPLE_*` properties are tracked as flags * @single_context: Whether a single or all gpu contexts should be monitored + * @hold_preemption: Whether the preemption is disabled for the filtered + * context * @ctx_handle: A gem ctx handle for use with @single_context * @metrics_set: An ID for an OA unit metric set advertised via sysfs * @oa_format: An OA unit HW report format @@ -357,6 +359,7 @@ struct perf_open_properties { u32 sample_flags; u64 single_context:1; + u64 hold_preemption:1; u64 ctx_handle; /* OA sampling state */ @@ -2352,6 +2355,8 @@ static int i915_oa_stream_init(struct i915_perf_stream *stream, stream->sample_flags |= SAMPLE_OA_REPORT; stream->sample_size += format_size; + stream->hold_preemption = props->hold_preemption; + dev_priv->perf.oa.oa_buffer.format_size = format_size; if (WARN_ON(dev_priv->perf.oa.oa_buffer.format_size == 0)) return -EINVAL; @@ -2890,6 +2895,15 @@ i915_perf_open_ioctl_locked(struct drm_i915_private *dev_priv, } } + if (props->hold_preemption) { + if (!props->single_context) { + DRM_DEBUG("preemption disable with no context\n"); + ret = -EINVAL; + goto err; + } + privileged_op = true; + } + /* * On Haswell the OA unit supports clock gating off for a specific * context and in this mode there's no visibility of metrics for the @@ -2904,8 +2918,9 @@ i915_perf_open_ioctl_locked(struct drm_i915_private *dev_priv, * MI_REPORT_PERF_COUNT commands and so consider it a privileged op to * enable the OA unit by default. */ - if (IS_HASWELL(dev_priv) && specific_ctx) + if (IS_HASWELL(dev_priv) && specific_ctx && !props->hold_preemption) { privileged_op = false; + } /* Similar to perf's kernel.perf_paranoid_cpu sysctl option * we check a dev.i915.perf_stream_paranoid sysctl option @@ -2914,7 +2929,7 @@ i915_perf_open_ioctl_locked(struct drm_i915_private *dev_priv, */ if (privileged_op && i915_perf_stream_paranoid && !capable(CAP_SYS_ADMIN)) { - DRM_DEBUG("Insufficient privileges to open system-wide i915 perf stream\n"); + DRM_DEBUG("Insufficient privileges to open i915 perf stream\n"); ret = -EACCES; goto err_ctx; } @@ -3106,6 +3121,9 @@ static int read_properties_unlocked(struct drm_i915_private *dev_priv, props->oa_periodic = true; props->oa_period_exponent = value; break; + case DRM_I915_PERF_PROP_HOLD_PREEMPTION: + props->hold_preemption = value != 0 ? 1 : 0; + break; case DRM_I915_PERF_PROP_MAX: MISSING_CASE(id); return -EINVAL; diff --git a/drivers/gpu/drm/i915/i915_priolist_types.h b/drivers/gpu/drm/i915/i915_priolist_types.h index 49709de69875..fc1c1fe7ce58 100644 --- a/drivers/gpu/drm/i915/i915_priolist_types.h +++ b/drivers/gpu/drm/i915/i915_priolist_types.h @@ -17,6 +17,13 @@ enum { I915_PRIORITY_NORMAL = I915_CONTEXT_DEFAULT_PRIORITY, I915_PRIORITY_MAX = I915_CONTEXT_MAX_USER_PRIORITY + 1, + /* Requests containing performance queries must not be preempted by + * another context. They get scheduled with their default priority and + * once they reach the execlist ports we bump them to + * I915_PRIORITY_PERF so that they stick to the HW until they finish. + */ + I915_PRIORITY_PERF = I915_CONTEXT_MAX_USER_PRIORITY + 2, + I915_PRIORITY_INVALID = INT_MIN }; diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c index 5ff87c4a0cd5..44b4b2bfedf9 100644 --- a/drivers/gpu/drm/i915/i915_request.c +++ b/drivers/gpu/drm/i915/i915_request.c @@ -685,6 +685,7 @@ __i915_request_create(struct intel_context *ce, gfp_t gfp) rq->batch = NULL; rq->capture_list = NULL; rq->waitboost = false; + rq->has_perf = false; rq->execution_mask = ALL_ENGINES; INIT_LIST_HEAD(&rq->active_list); diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h index b58ceef92e20..21be6b44602d 100644 --- a/drivers/gpu/drm/i915/i915_request.h +++ b/drivers/gpu/drm/i915/i915_request.h @@ -217,6 +217,7 @@ struct i915_request { unsigned long emitted_jiffies; bool waitboost; + bool has_perf; /** timeline->request entry for this request */ struct list_head link; diff --git a/drivers/gpu/drm/i915/intel_guc_submission.c b/drivers/gpu/drm/i915/intel_guc_submission.c index 12c22359fdac..341e303ef0a3 100644 --- a/drivers/gpu/drm/i915/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/intel_guc_submission.c @@ -707,6 +707,16 @@ static inline int rq_prio(const struct i915_request *rq) return rq->sched.attr.priority | __NO_PREEMPTION; } +static inline int effective_prio(const struct i915_request *rq) +{ + int prio; + + if (rq->has_perf) + return I915_USER_PRIORITY(I915_PRIORITY_PERF) | __NO_PREEMPTION; + + return rq_prio(rq); +} + static struct i915_request *schedule_in(struct i915_request *rq, int idx) { trace_i915_request_in(rq, idx); @@ -747,7 +757,7 @@ static void __guc_dequeue(struct intel_engine_cs *engine) &engine->i915->guc.preempt_work[engine->id]; int prio = execlists->queue_priority_hint; - if (i915_scheduler_need_preempt(prio, rq_prio(last))) { + if (i915_scheduler_need_preempt(prio, effective_prio(last))) { intel_write_status_page(engine, I915_GEM_HWS_PREEMPT, GUC_PREEMPT_INPROGRESS); diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h index 5938a73963fe..e3c6c72193ed 100644 --- a/include/uapi/drm/i915_drm.h +++ b/include/uapi/drm/i915_drm.h @@ -1982,6 +1982,17 @@ enum drm_i915_perf_property_id { */ DRM_I915_PERF_PROP_OA_EXPONENT, + /** + * Specifying this property is only valid when specify a context to + * filter with DRM_I915_PERF_PROP_CTX_HANDLE. Specifying this property + * will hold preemption of the particular context we want to gather + * performance data about. The execbuf2 submissions must include a + * drm_i915_gem_execbuffer_ext_perf parameter for this to apply. + * + * This property is available in perf revision 2. + */ + DRM_I915_PERF_PROP_HOLD_PREEMPTION, + DRM_I915_PERF_PROP_MAX /* non-ABI */ };
We would like to make use of perf in Vulkan. The Vulkan API is much lower level than OpenGL, with applications directly exposed to the concept of command buffers (pretty much equivalent to our batch buffers). In Vulkan, queries are always limited in scope to a command buffer. In OpenGL, the lack of command buffer concept meant that queries' duration could span multiple command buffers. With that restriction gone in Vulkan, we would like to simplify measuring performance just by measuring the deltas between the counter snapshots written by 2 MI_RECORD_PERF_COUNT commands, rather than the more complex scheme we currently have in the GL driver, using 2 MI_RECORD_PERF_COUNT commands and doing some post processing on the stream of OA reports, coming from the global OA buffer, to remove any unrelated deltas in between the 2 MI_RECORD_PERF_COUNT. Disabling preemption only apply to a single context with which want to query performance counters for and is considered a privileged operation, by default protected by CAP_SYS_ADMIN. It is possible to enable it for a normal user by disabling the paranoid stream setting. v2: Store preemption setting in intel_context (Chris) v3: Use priorities to avoid preemption rather than the HW mechanism v4: Just modify the port priority reporting function Signed-off-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> --- drivers/gpu/drm/i915/gt/intel_lrc.c | 7 ++++++- drivers/gpu/drm/i915/i915_drv.c | 2 +- drivers/gpu/drm/i915/i915_drv.h | 13 ++++++++++++ drivers/gpu/drm/i915/i915_perf.c | 22 +++++++++++++++++++-- drivers/gpu/drm/i915/i915_priolist_types.h | 7 +++++++ drivers/gpu/drm/i915/i915_request.c | 1 + drivers/gpu/drm/i915/i915_request.h | 1 + drivers/gpu/drm/i915/intel_guc_submission.c | 12 ++++++++++- include/uapi/drm/i915_drm.h | 11 +++++++++++ 9 files changed, 71 insertions(+), 5 deletions(-)