Message ID | 20170418202335.35232-5-michel.thierry@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Apr 18, 2017 at 01:23:19PM -0700, Michel Thierry wrote: > From: Arun Siluvery <arun.siluvery@linux.intel.com> > > This is a preparatory patch which modifies error handler to do per engine > hang recovery. The actual patch which implements this sequence follows > later in the series. The aim is to prepare existing recovery function to > adapt to this new function where applicable (which fails at this point > because core implementation is lacking) and continue recovery using legacy > full gpu reset. > > A helper function is also added to query the availability of engine > reset. > > The error events behaviour that are used to notify user of reset are > adapted to engine reset such that it doesn't break users listening to these > events. In legacy we report an error event, a reset event before resetting > the gpu and a reset done event marking the completion of reset. The same > behaviour is adapted but reset event is only dispatched once even when > multiple engines are hung. Finally once reset is complete we send reset > done event as usual. > > Note that this implementation of engine reset is for i915 directly > submitting to the ELSP, where the driver manages the hang detection, > recovery and resubmission. With GuC submission these tasks are shared > between driver and firmware; i915 will still responsible for detecting a > hang, and when it does it will have to request GuC to reset that Engine and > remind the firmware about the outstanding submissions. This will be > added in different patch. > > v2: rebase, advertise engine reset availability in platform definition, > add note about GuC submission. > v3: s/*engine_reset*/*reset_engine*/. (Chris) > Handle reset as 2 level resets, by first going to engine only and fall > backing to full/chip reset as needed, i.e. reset_engine will need the > struct_mutex. > v4: Pass the engine mask to i915_reset. (Chris) > v5: Rebase, update selftests. > v6: Rebase, prepare for mutex-less reset engine. I'm not sure if there is any trace of the original patch left. Certainly this is the first that has come close to making me happy and looks like it might actually work. :) > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h > index e06af46f5a57..7bc5f552add7 100644 > --- a/drivers/gpu/drm/i915/i915_drv.h > +++ b/drivers/gpu/drm/i915/i915_drv.h > @@ -814,6 +814,7 @@ struct intel_csr { > func(has_ddi); \ > func(has_decoupled_mmio); \ > func(has_dp_mst); \ > + func(has_reset_engine); \ > func(has_fbc); \ > func(has_fpga_dbg); \ > func(has_full_ppgtt); \ > @@ -1616,6 +1617,9 @@ struct i915_gpu_error { > #define I915_RESET_HANDOFF 1 > #define I915_WEDGED (BITS_PER_LONG - 1) > > + /* if available, engine-specific reset is tried before full gpu reset */ > + u32 reset_engine_mask; I want to convince myself that storing this here is sensible. My expectation was that this would be passed along as a function parameter, so I'll need to go back and see why that doesn't work. > /** > * Waitqueue to signal when a hang is detected. Used to for waiters > * to release the struct_mutex for the reset to procede. > @@ -3019,6 +3023,8 @@ extern void i915_driver_unload(struct drm_device *dev); > extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask); > extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv); > extern void i915_reset(struct drm_i915_private *dev_priv); > +extern int i915_reset_engine(struct intel_engine_cs *engine); > +extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv); > extern int intel_guc_reset(struct drm_i915_private *dev_priv); > extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine); > extern void intel_hangcheck_init(struct drm_i915_private *dev_priv); > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c > index fd97fe00cd0d..ab1d77ab0977 100644 > --- a/drivers/gpu/drm/i915/i915_irq.c > +++ b/drivers/gpu/drm/i915/i915_irq.c > @@ -2645,12 +2645,33 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv) > char *error_event[] = { I915_ERROR_UEVENT "=1", NULL }; > char *reset_event[] = { I915_RESET_UEVENT "=1", NULL }; > char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL }; > + u32 engine_mask = dev_priv->gpu_error.reset_engine_mask; > > kobject_uevent_env(kobj, KOBJ_CHANGE, error_event); > > - DRM_DEBUG_DRIVER("resetting chip\n"); > + /* > + * This event needs to be sent before performing gpu reset. When > + * engine resets are supported we iterate through all engines and > + * reset hung engines individually. To keep the event dispatch > + * mechanism consistent with full gpu reset, this is only sent once > + * even when multiple engines are hung. It is also safe to move this > + * here because when we are in this function, we will definitely > + * perform gpu reset. > + */ > kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event); > > + /* try engine reset first, and continue if fails; look mom, no mutex! */ > + if (intel_has_reset_engine(dev_priv) && !(engine_mask & (engine_mask - 1))) { if (has_reset_engine() && is_power_of_2(engine_mask)) { engine = dev_priv->engine[ilog2(engine_mask)]; Not worth iterating over the engine_mask? for_each_bit() { if (i915_reset_engine(engine) == 0) dev_prv->gpu_error.reset_engine_mask &= ~intel_engine_flag() if (reset_engine_mask) DRM_WARN("per-engine reset failed, promoting to full gpu_reset\n"); > + struct intel_engine_cs *engine = > + dev_priv->engine[intel_engineid_from_flag(engine_mask)]; > + > + if (i915_reset_engine(engine) == 0) > + goto finish; > + > + DRM_ERROR("reset %s failed, promoting to full gpu reset\n", > + engine->name); Justify an *ERROR*. Is it a catastrophe? Do we have a recovery mechanism? A DRM_WARN is appropriate as something went wrong at the hw/driver level, but it should not be the end of the world - unlike the global reset itself failing. > + } > + > intel_prepare_reset(dev_priv); I am much happier now that we circumvent the big hammer global/display reset (and so this series now can remain blissfully ignorant of the display reset regression ;) -Chris
On 18/04/17 14:40, Chris Wilson wrote: > On Tue, Apr 18, 2017 at 01:23:19PM -0700, Michel Thierry wrote: >> From: Arun Siluvery <arun.siluvery@linux.intel.com> >> >> This is a preparatory patch which modifies error handler to do per engine >> hang recovery. The actual patch which implements this sequence follows >> later in the series. The aim is to prepare existing recovery function to >> adapt to this new function where applicable (which fails at this point >> because core implementation is lacking) and continue recovery using legacy >> full gpu reset. >> >> A helper function is also added to query the availability of engine >> reset. >> >> The error events behaviour that are used to notify user of reset are >> adapted to engine reset such that it doesn't break users listening to these >> events. In legacy we report an error event, a reset event before resetting >> the gpu and a reset done event marking the completion of reset. The same >> behaviour is adapted but reset event is only dispatched once even when >> multiple engines are hung. Finally once reset is complete we send reset >> done event as usual. >> >> Note that this implementation of engine reset is for i915 directly >> submitting to the ELSP, where the driver manages the hang detection, >> recovery and resubmission. With GuC submission these tasks are shared >> between driver and firmware; i915 will still responsible for detecting a >> hang, and when it does it will have to request GuC to reset that Engine and >> remind the firmware about the outstanding submissions. This will be >> added in different patch. >> >> v2: rebase, advertise engine reset availability in platform definition, >> add note about GuC submission. >> v3: s/*engine_reset*/*reset_engine*/. (Chris) >> Handle reset as 2 level resets, by first going to engine only and fall >> backing to full/chip reset as needed, i.e. reset_engine will need the >> struct_mutex. >> v4: Pass the engine mask to i915_reset. (Chris) >> v5: Rebase, update selftests. >> v6: Rebase, prepare for mutex-less reset engine. > > I'm not sure if there is any trace of the original patch left. Certainly > this is the first that has come close to making me happy and looks like > it might actually work. :) > >> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h >> index e06af46f5a57..7bc5f552add7 100644 >> --- a/drivers/gpu/drm/i915/i915_drv.h >> +++ b/drivers/gpu/drm/i915/i915_drv.h >> @@ -814,6 +814,7 @@ struct intel_csr { >> func(has_ddi); \ >> func(has_decoupled_mmio); \ >> func(has_dp_mst); \ >> + func(has_reset_engine); \ >> func(has_fbc); \ >> func(has_fpga_dbg); \ >> func(has_full_ppgtt); \ >> @@ -1616,6 +1617,9 @@ struct i915_gpu_error { >> #define I915_RESET_HANDOFF 1 >> #define I915_WEDGED (BITS_PER_LONG - 1) >> >> + /* if available, engine-specific reset is tried before full gpu reset */ >> + u32 reset_engine_mask; > > I want to convince myself that storing this here is sensible. My > expectation was that this would be passed along as a function parameter, > so I'll need to go back and see why that doesn't work. > This is a left-over from the previous version, the engine mask can be passed as a parameter to i915_reset_and_wakeup. >> /** >> * Waitqueue to signal when a hang is detected. Used to for waiters >> * to release the struct_mutex for the reset to procede. >> @@ -3019,6 +3023,8 @@ extern void i915_driver_unload(struct drm_device *dev); >> extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask); >> extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv); >> extern void i915_reset(struct drm_i915_private *dev_priv); >> +extern int i915_reset_engine(struct intel_engine_cs *engine); >> +extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv); >> extern int intel_guc_reset(struct drm_i915_private *dev_priv); >> extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine); >> extern void intel_hangcheck_init(struct drm_i915_private *dev_priv); >> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c >> index fd97fe00cd0d..ab1d77ab0977 100644 >> --- a/drivers/gpu/drm/i915/i915_irq.c >> +++ b/drivers/gpu/drm/i915/i915_irq.c >> @@ -2645,12 +2645,33 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv) >> char *error_event[] = { I915_ERROR_UEVENT "=1", NULL }; >> char *reset_event[] = { I915_RESET_UEVENT "=1", NULL }; >> char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL }; >> + u32 engine_mask = dev_priv->gpu_error.reset_engine_mask; >> >> kobject_uevent_env(kobj, KOBJ_CHANGE, error_event); >> >> - DRM_DEBUG_DRIVER("resetting chip\n"); >> + /* >> + * This event needs to be sent before performing gpu reset. When >> + * engine resets are supported we iterate through all engines and >> + * reset hung engines individually. To keep the event dispatch >> + * mechanism consistent with full gpu reset, this is only sent once >> + * even when multiple engines are hung. It is also safe to move this >> + * here because when we are in this function, we will definitely >> + * perform gpu reset. >> + */ >> kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event); >> >> + /* try engine reset first, and continue if fails; look mom, no mutex! */ >> + if (intel_has_reset_engine(dev_priv) && !(engine_mask & (engine_mask - 1))) { > > if (has_reset_engine() && is_power_of_2(engine_mask)) { > engine = dev_priv->engine[ilog2(engine_mask)]; > > Not worth iterating over the engine_mask? > > for_each_bit() { > if (i915_reset_engine(engine) == 0) > dev_prv->gpu_error.reset_engine_mask &= ~intel_engine_flag() > if (reset_engine_mask) > DRM_WARN("per-engine reset failed, promoting to full gpu_reset\n"); > Time-wise, I'm not so sure about the benefits of 2 reset_engines vs full gpu reset. But it can be done... and thanks for the ilog2 tip. >> + struct intel_engine_cs *engine = >> + dev_priv->engine[intel_engineid_from_flag(engine_mask)]; >> + >> + if (i915_reset_engine(engine) == 0) >> + goto finish; >> + >> + DRM_ERROR("reset %s failed, promoting to full gpu reset\n", >> + engine->name); > > Justify an *ERROR*. Is it a catastrophe? Do we have a recovery > mechanism? A DRM_WARN is appropriate as something went wrong at the > hw/driver level, but it should not be the end of the world - unlike the > global reset itself failing. > I'll change it to DRM_WARN. I haven't been able to reproduce it (only when I change the code on purpose, or with the live test - igt_render_engine_reset_fallback) >> + } >> + >> intel_prepare_reset(dev_priv); > > I am much happier now that we circumvent the big hammer global/display > reset (and so this series now can remain blissfully ignorant of the > display reset regression ;) > -Chris >
On Tue, Apr 18, 2017 at 03:01:02PM -0700, Michel Thierry wrote: > > > On 18/04/17 14:40, Chris Wilson wrote: > >>+ /* try engine reset first, and continue if fails; look mom, no mutex! */ > >>+ if (intel_has_reset_engine(dev_priv) && !(engine_mask & (engine_mask - 1))) { > > > >if (has_reset_engine() && is_power_of_2(engine_mask)) { > > engine = dev_priv->engine[ilog2(engine_mask)]; > > > >Not worth iterating over the engine_mask? > > > >for_each_bit() { > > if (i915_reset_engine(engine) == 0) > > dev_prv->gpu_error.reset_engine_mask &= ~intel_engine_flag() > >if (reset_engine_mask) > > DRM_WARN("per-engine reset failed, promoting to full gpu_reset\n"); > > > > Time-wise, I'm not so sure about the benefits of 2 reset_engines vs > full gpu reset. But it can be done... and thanks for the ilog2 tip. Hmm, looping isn't quite right. You would need to pass down the engine_mask so all engines were paused (in prepare), then reset, then restarted atomically. Will think a bit more about that. The immediate benefit is avoiding the struct_mutex + display futzing more often. -Chris
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c index cc7393e65e99..e03d0643dbd6 100644 --- a/drivers/gpu/drm/i915/i915_drv.c +++ b/drivers/gpu/drm/i915/i915_drv.c @@ -1800,6 +1800,8 @@ void i915_reset(struct drm_i915_private *dev_priv) if (!test_bit(I915_RESET_HANDOFF, &error->flags)) return; + DRM_DEBUG_DRIVER("resetting chip\n"); + /* Clear any previous failed attempts at recovery. Time to try again. */ if (!i915_gem_unset_wedged(dev_priv)) goto wakeup; @@ -1863,6 +1865,19 @@ void i915_reset(struct drm_i915_private *dev_priv) goto finish; } +/** + * i915_reset_engine - reset GPU engine to recover from a hang + * @engine: engine to reset + * + * Reset a specific GPU engine. Useful if a hang is detected. + * Returns zero on successful reset or otherwise an error code. + */ +int i915_reset_engine(struct intel_engine_cs *engine) +{ + /* FIXME: replace me with engine reset sequence */ + return -ENODEV; +} + static int i915_pm_suspend(struct device *kdev) { struct pci_dev *pdev = to_pci_dev(kdev); diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index e06af46f5a57..7bc5f552add7 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -814,6 +814,7 @@ struct intel_csr { func(has_ddi); \ func(has_decoupled_mmio); \ func(has_dp_mst); \ + func(has_reset_engine); \ func(has_fbc); \ func(has_fpga_dbg); \ func(has_full_ppgtt); \ @@ -1616,6 +1617,9 @@ struct i915_gpu_error { #define I915_RESET_HANDOFF 1 #define I915_WEDGED (BITS_PER_LONG - 1) + /* if available, engine-specific reset is tried before full gpu reset */ + u32 reset_engine_mask; + /** * Waitqueue to signal when a hang is detected. Used to for waiters * to release the struct_mutex for the reset to procede. @@ -3019,6 +3023,8 @@ extern void i915_driver_unload(struct drm_device *dev); extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask); extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv); extern void i915_reset(struct drm_i915_private *dev_priv); +extern int i915_reset_engine(struct intel_engine_cs *engine); +extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv); extern int intel_guc_reset(struct drm_i915_private *dev_priv); extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine); extern void intel_hangcheck_init(struct drm_i915_private *dev_priv); diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index fd97fe00cd0d..ab1d77ab0977 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -2645,12 +2645,33 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv) char *error_event[] = { I915_ERROR_UEVENT "=1", NULL }; char *reset_event[] = { I915_RESET_UEVENT "=1", NULL }; char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL }; + u32 engine_mask = dev_priv->gpu_error.reset_engine_mask; kobject_uevent_env(kobj, KOBJ_CHANGE, error_event); - DRM_DEBUG_DRIVER("resetting chip\n"); + /* + * This event needs to be sent before performing gpu reset. When + * engine resets are supported we iterate through all engines and + * reset hung engines individually. To keep the event dispatch + * mechanism consistent with full gpu reset, this is only sent once + * even when multiple engines are hung. It is also safe to move this + * here because when we are in this function, we will definitely + * perform gpu reset. + */ kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event); + /* try engine reset first, and continue if fails; look mom, no mutex! */ + if (intel_has_reset_engine(dev_priv) && !(engine_mask & (engine_mask - 1))) { + struct intel_engine_cs *engine = + dev_priv->engine[intel_engineid_from_flag(engine_mask)]; + + if (i915_reset_engine(engine) == 0) + goto finish; + + DRM_ERROR("reset %s failed, promoting to full gpu reset\n", + engine->name); + } + intel_prepare_reset(dev_priv); set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags); @@ -2680,6 +2701,7 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv) kobject_uevent_env(kobj, KOBJ_CHANGE, reset_done_event); +finish: /* * Note: The wake_up also serves as a memory barrier so that * waiters see the updated value of the dev_priv->gpu_error. @@ -2781,6 +2803,12 @@ void i915_handle_error(struct drm_i915_private *dev_priv, &dev_priv->gpu_error.flags)) goto out; + /* + * Save which engines need reset; if engine support is available, + * we can just reset the hung engines. + */ + dev_priv->gpu_error.reset_engine_mask = engine_mask; + i915_reset_and_wakeup(dev_priv); out: diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c index f87b0c4e564d..d5002b55cbd8 100644 --- a/drivers/gpu/drm/i915/i915_pci.c +++ b/drivers/gpu/drm/i915/i915_pci.c @@ -313,7 +313,8 @@ static const struct intel_device_info intel_haswell_info = { BDW_COLORS, \ .has_logical_ring_contexts = 1, \ .has_full_48bit_ppgtt = 1, \ - .has_64bit_reloc = 1 + .has_64bit_reloc = 1, \ + .has_reset_engine = 1 static const struct intel_device_info intel_broadwell_info = { BDW_FEATURES, @@ -345,6 +346,7 @@ static const struct intel_device_info intel_cherryview_info = { .has_gmch_display = 1, .has_aliasing_ppgtt = 1, .has_full_ppgtt = 1, + .has_reset_engine = 1, .display_mmio_offset = VLV_DISPLAY_BASE, GEN_CHV_PIPEOFFSETS, CURSOR_OFFSETS, @@ -394,6 +396,7 @@ static const struct intel_device_info intel_skylake_gt3_info = { .has_aliasing_ppgtt = 1, \ .has_full_ppgtt = 1, \ .has_full_48bit_ppgtt = 1, \ + .has_reset_engine = 1, \ GEN_DEFAULT_PIPEOFFSETS, \ IVB_CURSOR_OFFSETS, \ BDW_COLORS diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h index 00d36aa4e26d..6e5577dc1f7a 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.h +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h @@ -442,6 +442,22 @@ intel_engine_flag(const struct intel_engine_cs *engine) return BIT(engine->id); } +/* works only for engine_mask with 1 engine bit set */ +static inline unsigned int +intel_engineid_from_flag(unsigned engine_mask) +{ + unsigned int engine_id = 0; + + GEM_BUG_ON(engine_mask & (engine_mask - 1)); + + while (engine_mask >>= 1) + engine_id++; + + GEM_BUG_ON(engine_id >= I915_NUM_ENGINES); + + return engine_id; +} + static inline u32 intel_read_status_page(struct intel_engine_cs *engine, int reg) { diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c index 07a722f74fa1..ab5bdd110ac3 100644 --- a/drivers/gpu/drm/i915/intel_uncore.c +++ b/drivers/gpu/drm/i915/intel_uncore.c @@ -1776,6 +1776,17 @@ bool intel_has_gpu_reset(struct drm_i915_private *dev_priv) return intel_get_gpu_reset(dev_priv) != NULL; } +/* + * When GuC submission is enabled, GuC manages ELSP and can initiate the + * engine reset too. For now, fall back to full GPU reset if it is enabled. + */ +bool intel_has_reset_engine(struct drm_i915_private *dev_priv) +{ + return (dev_priv->info.has_reset_engine && + !dev_priv->guc.execbuf_client && + i915.reset == 2); +} + int intel_guc_reset(struct drm_i915_private *dev_priv) { int ret;