Message ID | 20170418202335.35232-15-michel.thierry@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Apr 18, 2017 at 01:23:29PM -0700, Michel Thierry wrote: > This patch adds per engine reset and recovery (TDR) support when GuC is > used to submit workloads to GPU. > > In the case of i915 directly submission to ELSP, driver manages hang > detection, recovery and resubmission. With GuC submission these tasks > are shared between driver and GuC. i915 is still responsible for detecting > a hang, and when it does it only requests GuC to reset that Engine. GuC > internally manages acquiring forcewake and idling the engine before actually > resetting it. > > Once the reset is successful, i915 takes over again and handles resubmission. > The scheduler in i915 knows which requests are pending so after resetting > a engine, pending workloads/requests are resubmitted again. > > v2: s/i915_guc_request_engine_reset/i915_guc_reset_engine/ to match the > non-guc funtion names. > > Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com> > Signed-off-by: Jeff McGee <jeff.mcgee@intel.com> > Signed-off-by: Michel Thierry <michel.thierry@intel.com> > --- > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c > index 7df278fe492e..6295760098a1 100644 > --- a/drivers/gpu/drm/i915/intel_lrc.c > +++ b/drivers/gpu/drm/i915/intel_lrc.c > @@ -1176,14 +1176,15 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine) > > /* After a GPU reset, we may have requests to replay */ > clear_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted); > - if (!i915.enable_guc_submission && !execlists_elsp_idle(engine)) { > + if (!execlists_elsp_idle(engine)) { > DRM_DEBUG_DRIVER("Restarting %s from requests [0x%x, 0x%x]\n", > engine->name, > port_seqno(&engine->execlist_port[0]), > port_seqno(&engine->execlist_port[1])); > engine->execlist_port[0].count = 0; > engine->execlist_port[1].count = 0; > - execlists_submit_ports(engine); > + if (!dev_priv->guc.execbuf_client) > + execlists_submit_ports(engine); Not sure what you were intending to do here as this only resets the submission count -- which is not used by guc dequeue. Some merit in the making the code look similar, certainly adds the dbg message but I think it is unrelated to the rest of the patch. -Chris
On 19/04/17 03:27, Chris Wilson wrote: > On Tue, Apr 18, 2017 at 01:23:29PM -0700, Michel Thierry wrote: >> This patch adds per engine reset and recovery (TDR) support when GuC is >> used to submit workloads to GPU. >> >> In the case of i915 directly submission to ELSP, driver manages hang >> detection, recovery and resubmission. With GuC submission these tasks >> are shared between driver and GuC. i915 is still responsible for detecting >> a hang, and when it does it only requests GuC to reset that Engine. GuC >> internally manages acquiring forcewake and idling the engine before actually >> resetting it. >> >> Once the reset is successful, i915 takes over again and handles resubmission. >> The scheduler in i915 knows which requests are pending so after resetting >> a engine, pending workloads/requests are resubmitted again. >> >> v2: s/i915_guc_request_engine_reset/i915_guc_reset_engine/ to match the >> non-guc funtion names. >> >> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com> >> Signed-off-by: Jeff McGee <jeff.mcgee@intel.com> >> Signed-off-by: Michel Thierry <michel.thierry@intel.com> >> --- >> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c >> index 7df278fe492e..6295760098a1 100644 >> --- a/drivers/gpu/drm/i915/intel_lrc.c >> +++ b/drivers/gpu/drm/i915/intel_lrc.c >> @@ -1176,14 +1176,15 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine) >> >> /* After a GPU reset, we may have requests to replay */ >> clear_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted); >> - if (!i915.enable_guc_submission && !execlists_elsp_idle(engine)) { >> + if (!execlists_elsp_idle(engine)) { >> DRM_DEBUG_DRIVER("Restarting %s from requests [0x%x, 0x%x]\n", >> engine->name, >> port_seqno(&engine->execlist_port[0]), >> port_seqno(&engine->execlist_port[1])); >> engine->execlist_port[0].count = 0; >> engine->execlist_port[1].count = 0; >> - execlists_submit_ports(engine); >> + if (!dev_priv->guc.execbuf_client) >> + execlists_submit_ports(engine); > > Not sure what you were intending to do here as this only resets the > submission count -- which is not used by guc dequeue. Some merit in the > making the code look similar, certainly adds the dbg message but I think > it is unrelated to the rest of the patch. Yes, it only keeps the same debug message (originally added to check it was taking the right path). I can remove if you think it doesn't provide anything useful.
On Wed, Apr 19, 2017 at 04:22:43PM -0700, Michel Thierry wrote: > On 19/04/17 03:27, Chris Wilson wrote: > >On Tue, Apr 18, 2017 at 01:23:29PM -0700, Michel Thierry wrote: > >>This patch adds per engine reset and recovery (TDR) support when GuC is > >>used to submit workloads to GPU. > >> > >>In the case of i915 directly submission to ELSP, driver manages hang > >>detection, recovery and resubmission. With GuC submission these tasks > >>are shared between driver and GuC. i915 is still responsible for detecting > >>a hang, and when it does it only requests GuC to reset that Engine. GuC > >>internally manages acquiring forcewake and idling the engine before actually > >>resetting it. > >> > >>Once the reset is successful, i915 takes over again and handles resubmission. > >>The scheduler in i915 knows which requests are pending so after resetting > >>a engine, pending workloads/requests are resubmitted again. > >> > >>v2: s/i915_guc_request_engine_reset/i915_guc_reset_engine/ to match the > >>non-guc funtion names. > >> > >>Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com> > >>Signed-off-by: Jeff McGee <jeff.mcgee@intel.com> > >>Signed-off-by: Michel Thierry <michel.thierry@intel.com> > >>--- > >>diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c > >>index 7df278fe492e..6295760098a1 100644 > >>--- a/drivers/gpu/drm/i915/intel_lrc.c > >>+++ b/drivers/gpu/drm/i915/intel_lrc.c > >>@@ -1176,14 +1176,15 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine) > >> > >> /* After a GPU reset, we may have requests to replay */ > >> clear_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted); > >>- if (!i915.enable_guc_submission && !execlists_elsp_idle(engine)) { > >>+ if (!execlists_elsp_idle(engine)) { > >> DRM_DEBUG_DRIVER("Restarting %s from requests [0x%x, 0x%x]\n", > >> engine->name, > >> port_seqno(&engine->execlist_port[0]), > >> port_seqno(&engine->execlist_port[1])); > >> engine->execlist_port[0].count = 0; > >> engine->execlist_port[1].count = 0; > >>- execlists_submit_ports(engine); > >>+ if (!dev_priv->guc.execbuf_client) > >>+ execlists_submit_ports(engine); > > > >Not sure what you were intending to do here as this only resets the > >submission count -- which is not used by guc dequeue. Some merit in the > >making the code look similar, certainly adds the dbg message but I think > >it is unrelated to the rest of the patch. > > Yes, it only keeps the same debug message (originally added to check > it was taking the right path). I can remove if you think it doesn't > provide anything useful. Just a small patch by itself, it is only a distraction to the larger patch. -Chris
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c index 974be1fa77f9..b7e2fa8a0036 100644 --- a/drivers/gpu/drm/i915/i915_drv.c +++ b/drivers/gpu/drm/i915/i915_drv.c @@ -1910,23 +1910,34 @@ int i915_reset_engine(struct intel_engine_cs *engine) */ i915_gem_reset_engine(engine); - /* forcing engine to idle */ - ret = intel_reset_engine_start(engine); - if (ret) { - DRM_ERROR("Failed to disable %s\n", engine->name); - goto error; - } + if (!dev_priv->guc.execbuf_client) { + /* forcing engine to idle */ + ret = intel_reset_engine_start(engine); + if (ret) { + DRM_ERROR("Failed to disable %s\n", engine->name); + goto error; + } - /* finally, reset engine */ - ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine)); - if (ret) { - DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret); + /* finally, reset engine */ + ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine)); + if (ret) { + DRM_ERROR("Failed to reset %s, ret=%d\n", + engine->name, ret); + intel_reset_engine_cancel(engine); + goto error; + } + + /* be sure the request reset bit gets cleared */ intel_reset_engine_cancel(engine); - goto error; - } - /* be sure the request reset bit gets cleared */ - intel_reset_engine_cancel(engine); + } else { + ret = i915_guc_reset_engine(engine); + if (ret) { + DRM_ERROR("GuC failed to reset %s, ret=%d\n", + engine->name, ret); + goto error; + } + } i915_gem_reset_finish_engine(engine); @@ -1935,6 +1946,10 @@ int i915_reset_engine(struct intel_engine_cs *engine) if (ret) goto error; + /* for guc too */ + if (dev_priv->guc.execbuf_client) + i915_guc_submission_reenable_engine(engine); + error->reset_engine_count[engine->id]++; wakeup: diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 71c34f15be64..5f2345fbff44 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -3029,6 +3029,7 @@ extern int i915_reset_engine(struct intel_engine_cs *engine); extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv); extern int intel_reset_engine_start(struct intel_engine_cs *engine); extern void intel_reset_engine_cancel(struct intel_engine_cs *engine); +extern int i915_guc_reset_engine(struct intel_engine_cs *engine); extern int intel_guc_reset(struct drm_i915_private *dev_priv); extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine); extern void intel_hangcheck_init(struct drm_i915_private *dev_priv); diff --git a/drivers/gpu/drm/i915/i915_guc_submission.c b/drivers/gpu/drm/i915/i915_guc_submission.c index d772718861df..c8067aeab6f4 100644 --- a/drivers/gpu/drm/i915/i915_guc_submission.c +++ b/drivers/gpu/drm/i915/i915_guc_submission.c @@ -1338,6 +1338,25 @@ void i915_guc_submission_disable(struct drm_i915_private *dev_priv) guc->execbuf_client = NULL; } +void i915_guc_submission_reenable_engine(struct intel_engine_cs *engine) +{ + struct drm_i915_private *dev_priv = engine->i915; + struct intel_guc *guc = &dev_priv->guc; + struct i915_guc_client *client = guc->execbuf_client; + const int wqi_size = sizeof(struct guc_wq_item); + struct drm_i915_gem_request *rq; + + GEM_BUG_ON(!client); + intel_guc_sample_forcewake(guc); + + spin_lock_irq(&engine->timeline->lock); + list_for_each_entry(rq, &engine->timeline->requests, link) { + guc_client_update_wq_rsvd(client, wqi_size); + __i915_guc_submit(rq); + } + spin_unlock_irq(&engine->timeline->lock); +} + /** * intel_guc_suspend() - notify GuC entering suspend state * @dev_priv: i915 device private @@ -1389,3 +1408,32 @@ int intel_guc_resume(struct drm_i915_private *dev_priv) return intel_guc_send(guc, data, ARRAY_SIZE(data)); } + +int i915_guc_reset_engine(struct intel_engine_cs *engine) +{ + struct drm_i915_private *dev_priv = engine->i915; + struct intel_guc *guc = &dev_priv->guc; + struct i915_gem_context *ctx; + u32 data[7]; + + if (!i915.enable_guc_submission) + return 0; + + ctx = dev_priv->kernel_context; + + /* + * The affected context report is populated by GuC and is provided + * to the driver using the shared page. We request for it but don't + * use it as scheduler has all of these details. + */ + data[0] = INTEL_GUC_ACTION_REQUEST_ENGINE_RESET; + data[1] = engine->guc_id; + data[2] = INTEL_GUC_RESET_OPTION_REPORT_AFFECTED_CONTEXTS; + data[3] = 0; + data[4] = 0; + data[5] = guc->execbuf_client->stage_id; + /* first page is shared data with GuC */ + data[6] = guc_ggtt_offset(ctx->engine[RCS].state); + + return intel_guc_send(guc, data, ARRAY_SIZE(data)); +} diff --git a/drivers/gpu/drm/i915/intel_guc_fwif.h b/drivers/gpu/drm/i915/intel_guc_fwif.h index e6f8079df94a..081f2cf614e6 100644 --- a/drivers/gpu/drm/i915/intel_guc_fwif.h +++ b/drivers/gpu/drm/i915/intel_guc_fwif.h @@ -505,6 +505,7 @@ union guc_log_control { /* This Action will be programmed in C180 - SOFT_SCRATCH_O_REG */ enum intel_guc_action { INTEL_GUC_ACTION_DEFAULT = 0x0, + INTEL_GUC_ACTION_REQUEST_ENGINE_RESET = 0x3, INTEL_GUC_ACTION_SAMPLE_FORCEWAKE = 0x6, INTEL_GUC_ACTION_ALLOCATE_DOORBELL = 0x10, INTEL_GUC_ACTION_DEALLOCATE_DOORBELL = 0x20, @@ -518,6 +519,11 @@ enum intel_guc_action { INTEL_GUC_ACTION_LIMIT }; +/* Reset engine options */ +enum action_engine_reset_options { + INTEL_GUC_RESET_OPTION_REPORT_AFFECTED_CONTEXTS = 0x10, +}; + /* * The GuC sends its response to a command by overwriting the * command in SS0. The response is distinguishable from a command diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index 7df278fe492e..6295760098a1 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -1176,14 +1176,15 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine) /* After a GPU reset, we may have requests to replay */ clear_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted); - if (!i915.enable_guc_submission && !execlists_elsp_idle(engine)) { + if (!execlists_elsp_idle(engine)) { DRM_DEBUG_DRIVER("Restarting %s from requests [0x%x, 0x%x]\n", engine->name, port_seqno(&engine->execlist_port[0]), port_seqno(&engine->execlist_port[1])); engine->execlist_port[0].count = 0; engine->execlist_port[1].count = 0; - execlists_submit_ports(engine); + if (!dev_priv->guc.execbuf_client) + execlists_submit_ports(engine); } return 0; diff --git a/drivers/gpu/drm/i915/intel_uc.h b/drivers/gpu/drm/i915/intel_uc.h index 2f0229da20bb..a348d08ce227 100644 --- a/drivers/gpu/drm/i915/intel_uc.h +++ b/drivers/gpu/drm/i915/intel_uc.h @@ -247,6 +247,7 @@ int i915_guc_wq_reserve(struct drm_i915_gem_request *rq); void i915_guc_wq_unreserve(struct drm_i915_gem_request *request); void i915_guc_submission_disable(struct drm_i915_private *dev_priv); void i915_guc_submission_fini(struct drm_i915_private *dev_priv); +void i915_guc_submission_reenable_engine(struct intel_engine_cs *engine); struct i915_vma *intel_guc_allocate_vma(struct intel_guc *guc, u32 size); /* intel_guc_log.c */ diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c index 120fb440bb8b..778813ac7628 100644 --- a/drivers/gpu/drm/i915/intel_uncore.c +++ b/drivers/gpu/drm/i915/intel_uncore.c @@ -1781,14 +1781,9 @@ bool intel_has_gpu_reset(struct drm_i915_private *dev_priv) return intel_get_gpu_reset(dev_priv) != NULL; } -/* - * When GuC submission is enabled, GuC manages ELSP and can initiate the - * engine reset too. For now, fall back to full GPU reset if it is enabled. - */ bool intel_has_reset_engine(struct drm_i915_private *dev_priv) { return (dev_priv->info.has_reset_engine && - !dev_priv->guc.execbuf_client && i915.reset == 2); }