Message ID | 20211220120030.4116079-1-tejaskumarx.surendrakumar.upadhyay@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/i915/gt: reset RING_HEAD during intel_gt_unset_wedged | expand |
On 12/20/21 13:00, Tejas Upadhyay wrote: > During repeated wedged-unwedged, it is > found that i915_request_retire zaps the old > request with 0x6b6b6b6b. > > On unwedged, we write a new request at RING_TAIL, > expecting to start executuing from that position, > but execution resumes from RING_HEAD (preserved > from an earlier wakeup before wedging) and > consumes the 0x6b. > > Resetting kernel/user context setup enables > RING_HEAD to use RING_TAIL for submitting new > requests which resolves issue. Normally this reset > is applied when unpinning a user context, or for > kernel_contexts upon waking up the device. But fast > wedged-unwedged sequence will keep the device awake, > preserving the RING_HEAD from before. > > Testcase: igt@gem_eio@unwedge-stress > > Note : Current user impact is assessed to be low, as > this only affects intel_gt_unset_wedged which is > currently only used during testing and upon suspend > resume (where the device was already flushed and will > reset the kernel_contexts on waking up). In the > future though, this will present an issue for PCI > error recovery. > > Signed-off-by: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com> > --- > drivers/gpu/drm/i915/gt/intel_engine_cs.c | 8 ++++++++ > drivers/gpu/drm/i915/gt/intel_reset.c | 3 +++ > 2 files changed, 11 insertions(+) > > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c > index 352254e001b4..7e1c561bce69 100644 > --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c > +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c > @@ -1455,9 +1455,17 @@ void intel_engines_reset_default_submission(struct intel_gt *gt) > enum intel_engine_id id; > > for_each_engine(engine, gt, id) { > + struct intel_context *ce = engine->kernel_context; > + > if (engine->sanitize) > engine->sanitize(engine); > > + /* Reset RING_HEAD so we don't consume the old > + * poisoned request on unwedging > + */ > + if (ce) > + ce->ops->reset(ce); > + Would it make sense to instead stop excluding the kernel context in intel_engine_reset_pinned_contexts, which is called from engine->sanitize(). That would lead to a double kernel context reset in some situations, though, but I figure the above would as well if the engine was parked before unwedging? How is the PCI error recovery affected? Do we have have a plan to address that? Thanks, /Thomas > engine->set_default_submission(engine); > } > } > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c > index 7be0002d9d70..1c26e936e699 100644 > --- a/drivers/gpu/drm/i915/gt/intel_reset.c > +++ b/drivers/gpu/drm/i915/gt/intel_reset.c > @@ -961,6 +961,9 @@ static bool __intel_gt_unset_wedged(struct intel_gt *gt) > } > spin_unlock(&timelines->lock); > > + /* Ensure that all non-kernel contexts are unpinned as well */ > + intel_gt_retire_requests(gt); > + > /* We must reset pending GPU events before restoring our submission */ > ok = !HAS_EXECLISTS(gt->i915); /* XXX better agnosticism desired */ > if (!INTEL_INFO(gt->i915)->gpu_reset_clobbers_display)
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c index 352254e001b4..7e1c561bce69 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c @@ -1455,9 +1455,17 @@ void intel_engines_reset_default_submission(struct intel_gt *gt) enum intel_engine_id id; for_each_engine(engine, gt, id) { + struct intel_context *ce = engine->kernel_context; + if (engine->sanitize) engine->sanitize(engine); + /* Reset RING_HEAD so we don't consume the old + * poisoned request on unwedging + */ + if (ce) + ce->ops->reset(ce); + engine->set_default_submission(engine); } } diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index 7be0002d9d70..1c26e936e699 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -961,6 +961,9 @@ static bool __intel_gt_unset_wedged(struct intel_gt *gt) } spin_unlock(&timelines->lock); + /* Ensure that all non-kernel contexts are unpinned as well */ + intel_gt_retire_requests(gt); + /* We must reset pending GPU events before restoring our submission */ ok = !HAS_EXECLISTS(gt->i915); /* XXX better agnosticism desired */ if (!INTEL_INFO(gt->i915)->gpu_reset_clobbers_display)
During repeated wedged-unwedged, it is found that i915_request_retire zaps the old request with 0x6b6b6b6b. On unwedged, we write a new request at RING_TAIL, expecting to start executuing from that position, but execution resumes from RING_HEAD (preserved from an earlier wakeup before wedging) and consumes the 0x6b. Resetting kernel/user context setup enables RING_HEAD to use RING_TAIL for submitting new requests which resolves issue. Normally this reset is applied when unpinning a user context, or for kernel_contexts upon waking up the device. But fast wedged-unwedged sequence will keep the device awake, preserving the RING_HEAD from before. Testcase: igt@gem_eio@unwedge-stress Note : Current user impact is assessed to be low, as this only affects intel_gt_unset_wedged which is currently only used during testing and upon suspend resume (where the device was already flushed and will reset the kernel_contexts on waking up). In the future though, this will present an issue for PCI error recovery. Signed-off-by: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com> --- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 8 ++++++++ drivers/gpu/drm/i915/gt/intel_reset.c | 3 +++ 2 files changed, 11 insertions(+)