Message ID | 20170427231300.32841-3-michel.thierry@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Apr 27, 2017 at 04:12:42PM -0700, Michel Thierry wrote: > From: Arun Siluvery <arun.siluvery@linux.intel.com> > > This is a preparatory patch which modifies error handler to do per engine > hang recovery. The actual patch which implements this sequence follows > later in the series. The aim is to prepare existing recovery function to > adapt to this new function where applicable (which fails at this point > because core implementation is lacking) and continue recovery using legacy > full gpu reset. > > A helper function is also added to query the availability of engine > reset. > > The error events behaviour that are used to notify user of reset are > adapted to engine reset such that it doesn't break users listening to these > events. In legacy we report an error event, a reset event before resetting > the gpu and a reset done event marking the completion of reset. The same > behaviour is adapted but reset event is only dispatched once even when > multiple engines are hung. Finally once reset is complete we send reset > done event as usual. > > Note that this implementation of engine reset is for i915 directly > submitting to the ELSP, where the driver manages the hang detection, > recovery and resubmission. With GuC submission these tasks are shared > between driver and firmware; i915 will still responsible for detecting a > hang, and when it does it will have to request GuC to reset that Engine and > remind the firmware about the outstanding submissions. This will be > added in different patch. > > v2: rebase, advertise engine reset availability in platform definition, > add note about GuC submission. > v3: s/*engine_reset*/*reset_engine*/. (Chris) > Handle reset as 2 level resets, by first going to engine only and fall > backing to full/chip reset as needed, i.e. reset_engine will need the > struct_mutex. > v4: Pass the engine mask to i915_reset. (Chris) > v5: Rebase, update selftests. > v6: Rebase, prepare for mutex-less reset engine. > v7: Pass reset_engine mask as a function parameter, and iterate over the > engine mask for reset_engine. (Chris) > > Cc: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> > Signed-off-by: Ian Lister <ian.lister@intel.com> > Signed-off-by: Tomas Elf <tomas.elf@intel.com> > Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com> > Signed-off-by: Michel Thierry <michel.thierry@intel.com> > --- > drivers/gpu/drm/i915/i915_drv.c | 15 +++++++++++++++ > drivers/gpu/drm/i915/i915_drv.h | 3 +++ > drivers/gpu/drm/i915/i915_irq.c | 33 ++++++++++++++++++++++++++++++--- > drivers/gpu/drm/i915/i915_pci.c | 5 ++++- > drivers/gpu/drm/i915/intel_uncore.c | 11 +++++++++++ > 5 files changed, 63 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c > index c7d68e789642..48c8b69d9bde 100644 > --- a/drivers/gpu/drm/i915/i915_drv.c > +++ b/drivers/gpu/drm/i915/i915_drv.c > @@ -1800,6 +1800,8 @@ void i915_reset(struct drm_i915_private *dev_priv) > if (!test_bit(I915_RESET_HANDOFF, &error->flags)) > return; > > + DRM_DEBUG_DRIVER("resetting chip\n"); This is redundant since we have a "Resetting chip" already here. Just kill it. > + > /* Clear any previous failed attempts at recovery. Time to try again. */ > if (!i915_gem_unset_wedged(dev_priv)) > goto wakeup; > @@ -1863,6 +1865,19 @@ void i915_reset(struct drm_i915_private *dev_priv) > goto finish; > } > > +/** > + * i915_reset_engine - reset GPU engine to recover from a hang > + * @engine: engine to reset > + * > + * Reset a specific GPU engine. Useful if a hang is detected. > + * Returns zero on successful reset or otherwise an error code. > + */ > +int i915_reset_engine(struct intel_engine_cs *engine) > +{ > + /* FIXME: replace me with engine reset sequence */ > + return -ENODEV; > +} > + > static int i915_pm_suspend(struct device *kdev) > { > struct pci_dev *pdev = to_pci_dev(kdev); > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h > index e06af46f5a57..ab7e68626c49 100644 > --- a/drivers/gpu/drm/i915/i915_drv.h > +++ b/drivers/gpu/drm/i915/i915_drv.h > @@ -814,6 +814,7 @@ struct intel_csr { > func(has_ddi); \ > func(has_decoupled_mmio); \ > func(has_dp_mst); \ > + func(has_reset_engine); \ > func(has_fbc); \ > func(has_fpga_dbg); \ > func(has_full_ppgtt); \ > @@ -3019,6 +3020,8 @@ extern void i915_driver_unload(struct drm_device *dev); > extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask); > extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv); > extern void i915_reset(struct drm_i915_private *dev_priv); > +extern int i915_reset_engine(struct intel_engine_cs *engine); > +extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv); > extern int intel_guc_reset(struct drm_i915_private *dev_priv); > extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine); > extern void intel_hangcheck_init(struct drm_i915_private *dev_priv); > diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c > index fd97fe00cd0d..3a59ef1367ec 100644 > --- a/drivers/gpu/drm/i915/i915_irq.c > +++ b/drivers/gpu/drm/i915/i915_irq.c > @@ -2635,11 +2635,13 @@ static irqreturn_t gen8_irq_handler(int irq, void *arg) > /** > * i915_reset_and_wakeup - do process context error handling work > * @dev_priv: i915 device private > + * @engine_mask: engine(s) hung - for reset-engine only. > * > * Fire an error uevent so userspace can see that a hang or error > * was detected. > */ > -static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv) > +static void > +i915_reset_and_wakeup(struct drm_i915_private *dev_priv, u32 engine_mask) > { > struct kobject *kobj = &dev_priv->drm.primary->kdev->kobj; > char *error_event[] = { I915_ERROR_UEVENT "=1", NULL }; > @@ -2648,9 +2650,33 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv) > > kobject_uevent_env(kobj, KOBJ_CHANGE, error_event); > > - DRM_DEBUG_DRIVER("resetting chip\n"); > + /* > + * This event needs to be sent before performing gpu reset. When > + * engine resets are supported we iterate through all engines and > + * reset hung engines individually. To keep the event dispatch > + * mechanism consistent with full gpu reset, this is only sent once > + * even when multiple engines are hung. It is also safe to move this > + * here because when we are in this function, we will definitely > + * perform gpu reset. > + */ > kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event); > > + /* try engine reset first, and continue if fails; look mom, no mutex! */ > + if (intel_has_reset_engine(dev_priv)) { > + struct intel_engine_cs *engine; > + unsigned int tmp; > + > + for_each_engine_masked(engine, dev_priv, engine_mask, tmp) { > + if (i915_reset_engine(engine) == 0) > + engine_mask &= ~intel_engine_flag(engine); > + } > + > + if (engine_mask) > + DRM_WARN("per-engine reset failed, promoting to full gpu reset\n"); > + else > + goto finish; This will look nicer if we did just try per-engine reset and then quitely promote (it's not that quiet as we do get logging) to global. for_each_engine_masked() {} if (!engine_mask) > + } > + > intel_prepare_reset(dev_priv); > > set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags); > @@ -2680,6 +2706,7 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv) > kobject_uevent_env(kobj, > KOBJ_CHANGE, reset_done_event); > > +finish: > /* > * Note: The wake_up also serves as a memory barrier so that > * waiters see the updated value of the dev_priv->gpu_error. > @@ -2781,7 +2808,7 @@ void i915_handle_error(struct drm_i915_private *dev_priv, > &dev_priv->gpu_error.flags)) > goto out; > > - i915_reset_and_wakeup(dev_priv); > + i915_reset_and_wakeup(dev_priv, engine_mask); ? You don't need to wakeup the struct_mutex so we don't need this after per-engine resets. Time to split up i915_reset_and_wakeup(), because we certainly shouldn't be calling intel_finish_reset() without first calling intel_prepare_reset(). Which is right here in my tree... > +/* > + * When GuC submission is enabled, GuC manages ELSP and can initiate the > + * engine reset too. For now, fall back to full GPU reset if it is enabled. > + */ > +bool intel_has_reset_engine(struct drm_i915_private *dev_priv) > +{ > + return (dev_priv->info.has_reset_engine && > + !dev_priv->guc.execbuf_client && > + i915.reset == 2); i915.reset >= 2 -Chris
On 4/29/2017 7:19 AM, Chris Wilson wrote: > On Thu, Apr 27, 2017 at 04:12:42PM -0700, Michel Thierry wrote: >> From: Arun Siluvery <arun.siluvery@linux.intel.com> >> >> This is a preparatory patch which modifies error handler to do per engine >> hang recovery. The actual patch which implements this sequence follows >> later in the series. The aim is to prepare existing recovery function to >> adapt to this new function where applicable (which fails at this point >> because core implementation is lacking) and continue recovery using legacy >> full gpu reset. >> >> A helper function is also added to query the availability of engine >> reset. >> >> The error events behaviour that are used to notify user of reset are >> adapted to engine reset such that it doesn't break users listening to these >> events. In legacy we report an error event, a reset event before resetting >> the gpu and a reset done event marking the completion of reset. The same >> behaviour is adapted but reset event is only dispatched once even when >> multiple engines are hung. Finally once reset is complete we send reset >> done event as usual. >> >> Note that this implementation of engine reset is for i915 directly >> submitting to the ELSP, where the driver manages the hang detection, >> recovery and resubmission. With GuC submission these tasks are shared >> between driver and firmware; i915 will still responsible for detecting a >> hang, and when it does it will have to request GuC to reset that Engine and >> remind the firmware about the outstanding submissions. This will be >> added in different patch. >> >> v2: rebase, advertise engine reset availability in platform definition, >> add note about GuC submission. >> v3: s/*engine_reset*/*reset_engine*/. (Chris) >> Handle reset as 2 level resets, by first going to engine only and fall >> backing to full/chip reset as needed, i.e. reset_engine will need the >> struct_mutex. >> v4: Pass the engine mask to i915_reset. (Chris) >> v5: Rebase, update selftests. >> v6: Rebase, prepare for mutex-less reset engine. >> v7: Pass reset_engine mask as a function parameter, and iterate over the >> engine mask for reset_engine. (Chris) >> >> Cc: Chris Wilson <chris@chris-wilson.co.uk> >> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> >> Signed-off-by: Ian Lister <ian.lister@intel.com> >> Signed-off-by: Tomas Elf <tomas.elf@intel.com> >> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com> >> Signed-off-by: Michel Thierry <michel.thierry@intel.com> >> --- >> drivers/gpu/drm/i915/i915_drv.c | 15 +++++++++++++++ >> drivers/gpu/drm/i915/i915_drv.h | 3 +++ >> drivers/gpu/drm/i915/i915_irq.c | 33 ++++++++++++++++++++++++++++++--- >> drivers/gpu/drm/i915/i915_pci.c | 5 ++++- >> drivers/gpu/drm/i915/intel_uncore.c | 11 +++++++++++ >> 5 files changed, 63 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c >> index c7d68e789642..48c8b69d9bde 100644 >> --- a/drivers/gpu/drm/i915/i915_drv.c >> +++ b/drivers/gpu/drm/i915/i915_drv.c >> @@ -1800,6 +1800,8 @@ void i915_reset(struct drm_i915_private *dev_priv) >> if (!test_bit(I915_RESET_HANDOFF, &error->flags)) >> return; >> >> + DRM_DEBUG_DRIVER("resetting chip\n"); > > This is redundant since we have a "Resetting chip" already here. Just > kill it. > >> + >> /* Clear any previous failed attempts at recovery. Time to try again. */ >> if (!i915_gem_unset_wedged(dev_priv)) >> goto wakeup; >> @@ -1863,6 +1865,19 @@ void i915_reset(struct drm_i915_private *dev_priv) >> goto finish; >> } >> >> +/** >> + * i915_reset_engine - reset GPU engine to recover from a hang >> + * @engine: engine to reset >> + * >> + * Reset a specific GPU engine. Useful if a hang is detected. >> + * Returns zero on successful reset or otherwise an error code. >> + */ >> +int i915_reset_engine(struct intel_engine_cs *engine) >> +{ >> + /* FIXME: replace me with engine reset sequence */ >> + return -ENODEV; >> +} >> + >> static int i915_pm_suspend(struct device *kdev) >> { >> struct pci_dev *pdev = to_pci_dev(kdev); >> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h >> index e06af46f5a57..ab7e68626c49 100644 >> --- a/drivers/gpu/drm/i915/i915_drv.h >> +++ b/drivers/gpu/drm/i915/i915_drv.h >> @@ -814,6 +814,7 @@ struct intel_csr { >> func(has_ddi); \ >> func(has_decoupled_mmio); \ >> func(has_dp_mst); \ >> + func(has_reset_engine); \ >> func(has_fbc); \ >> func(has_fpga_dbg); \ >> func(has_full_ppgtt); \ >> @@ -3019,6 +3020,8 @@ extern void i915_driver_unload(struct drm_device *dev); >> extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask); >> extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv); >> extern void i915_reset(struct drm_i915_private *dev_priv); >> +extern int i915_reset_engine(struct intel_engine_cs *engine); >> +extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv); >> extern int intel_guc_reset(struct drm_i915_private *dev_priv); >> extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine); >> extern void intel_hangcheck_init(struct drm_i915_private *dev_priv); >> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c >> index fd97fe00cd0d..3a59ef1367ec 100644 >> --- a/drivers/gpu/drm/i915/i915_irq.c >> +++ b/drivers/gpu/drm/i915/i915_irq.c >> @@ -2635,11 +2635,13 @@ static irqreturn_t gen8_irq_handler(int irq, void *arg) >> /** >> * i915_reset_and_wakeup - do process context error handling work >> * @dev_priv: i915 device private >> + * @engine_mask: engine(s) hung - for reset-engine only. >> * >> * Fire an error uevent so userspace can see that a hang or error >> * was detected. >> */ >> -static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv) >> +static void >> +i915_reset_and_wakeup(struct drm_i915_private *dev_priv, u32 engine_mask) >> { >> struct kobject *kobj = &dev_priv->drm.primary->kdev->kobj; >> char *error_event[] = { I915_ERROR_UEVENT "=1", NULL }; >> @@ -2648,9 +2650,33 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv) >> >> kobject_uevent_env(kobj, KOBJ_CHANGE, error_event); >> >> - DRM_DEBUG_DRIVER("resetting chip\n"); >> + /* >> + * This event needs to be sent before performing gpu reset. When >> + * engine resets are supported we iterate through all engines and >> + * reset hung engines individually. To keep the event dispatch >> + * mechanism consistent with full gpu reset, this is only sent once >> + * even when multiple engines are hung. It is also safe to move this >> + * here because when we are in this function, we will definitely >> + * perform gpu reset. >> + */ >> kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event); >> >> + /* try engine reset first, and continue if fails; look mom, no mutex! */ >> + if (intel_has_reset_engine(dev_priv)) { >> + struct intel_engine_cs *engine; >> + unsigned int tmp; >> + >> + for_each_engine_masked(engine, dev_priv, engine_mask, tmp) { >> + if (i915_reset_engine(engine) == 0) >> + engine_mask &= ~intel_engine_flag(engine); >> + } >> + >> + if (engine_mask) >> + DRM_WARN("per-engine reset failed, promoting to full gpu reset\n"); >> + else >> + goto finish; > > This will look nicer if we did just try per-engine reset and then > quitely promote (it's not that quiet as we do get logging) to global. > > for_each_engine_masked() {} > if (!engine_mask) > > >> + } >> + >> intel_prepare_reset(dev_priv); >> >> set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags); >> @@ -2680,6 +2706,7 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv) >> kobject_uevent_env(kobj, >> KOBJ_CHANGE, reset_done_event); >> >> +finish: >> /* >> * Note: The wake_up also serves as a memory barrier so that >> * waiters see the updated value of the dev_priv->gpu_error. >> @@ -2781,7 +2808,7 @@ void i915_handle_error(struct drm_i915_private *dev_priv, >> &dev_priv->gpu_error.flags)) >> goto out; >> >> - i915_reset_and_wakeup(dev_priv); >> + i915_reset_and_wakeup(dev_priv, engine_mask); > > ? You don't need to wakeup the struct_mutex so we don't need this after > per-engine resets. Time to split up i915_reset_and_wakeup(), because we > certainly shouldn't be calling intel_finish_reset() without first calling > intel_prepare_reset(). Which is right here in my tree... > Looking at your tree, it wouldn't call finish_reset there either, only these two are called after a successful reset: finish: clear_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags); wake_up_all(&dev_priv->gpu_error.reset_queue); But you're right, we only need to clear the error flag, no need to call wake_up_all. Should I move the per-engine reset to i915_handle_error, and then leave i915_reset_and_wakeup just for full resets? That would also make the promotion from per-engine to global look a bit 'clearer'. Thanks, >> +/* >> + * When GuC submission is enabled, GuC manages ELSP and can initiate the >> + * engine reset too. For now, fall back to full GPU reset if it is enabled. >> + */ >> +bool intel_has_reset_engine(struct drm_i915_private *dev_priv) >> +{ >> + return (dev_priv->info.has_reset_engine && >> + !dev_priv->guc.execbuf_client && >> + i915.reset == 2); > > i915.reset >= 2 > -Chris >
On 5/8/2017 11:31 AM, Michel Thierry wrote: > On 4/29/2017 7:19 AM, Chris Wilson wrote: >> On Thu, Apr 27, 2017 at 04:12:42PM -0700, Michel Thierry wrote: >>> From: Arun Siluvery <arun.siluvery@linux.intel.com> >>> ... >>> + } >>> + >>> intel_prepare_reset(dev_priv); >>> >>> set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags); >>> @@ -2680,6 +2706,7 @@ static void i915_reset_and_wakeup(struct >>> drm_i915_private *dev_priv) >>> kobject_uevent_env(kobj, >>> KOBJ_CHANGE, reset_done_event); >>> >>> +finish: >>> /* >>> * Note: The wake_up also serves as a memory barrier so that >>> * waiters see the updated value of the dev_priv->gpu_error. >>> @@ -2781,7 +2808,7 @@ void i915_handle_error(struct drm_i915_private >>> *dev_priv, >>> &dev_priv->gpu_error.flags)) >>> goto out; >>> >>> - i915_reset_and_wakeup(dev_priv); >>> + i915_reset_and_wakeup(dev_priv, engine_mask); >> >> ? You don't need to wakeup the struct_mutex so we don't need this after >> per-engine resets. Time to split up i915_reset_and_wakeup(), because we >> certainly shouldn't be calling intel_finish_reset() without first calling >> intel_prepare_reset(). Which is right here in my tree... >> > > Looking at your tree, it wouldn't call finish_reset there either, only > these two are called after a successful reset: > > finish: > clear_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags); > wake_up_all(&dev_priv->gpu_error.reset_queue); > > But you're right, we only need to clear the error flag, no need to call > wake_up_all. > > Should I move the per-engine reset to i915_handle_error, and then leave > i915_reset_and_wakeup just for full resets? > That would also make the promotion from per-engine to global look a bit > 'clearer'. > I just noticed an issue if I don't call wake_up_all. There can be someone else waiting for the reset to complete (i915_mutex_lock_interruptible -> i915_gem_wait_for_error). I915_RESET_BACKOFF has/had 2 roles, stop any other user to grab the struct mutex (which we won't need in reset-engine) and prevent two concurrent reset attempts (which we still want). Time to add a new flag for the later? (I915_RESET_ENGINE_IN_PROGRESS?) Here's an example without calling wake_up_all (10s timeout): [ 126.816054] [drm:i915_reset_engine [i915]] resetting rcs0 ... [ 137.499910] [IGT] gem_ringfill: exiting, ret=0 Compared to the one that does, [ 69.799519] [drm:i915_reset_engine [i915]] resetting rcs0 ... [ 69.801335] [IGT] gem_tdr: exiting, ret=0 Thanks, -Michel
On Fri, May 12, 2017 at 01:55:11PM -0700, Michel Thierry wrote: > On 5/8/2017 11:31 AM, Michel Thierry wrote: > >On 4/29/2017 7:19 AM, Chris Wilson wrote: > >>On Thu, Apr 27, 2017 at 04:12:42PM -0700, Michel Thierry wrote: > >>>From: Arun Siluvery <arun.siluvery@linux.intel.com> > >>> > ... > >>>+ } > >>>+ > >>> intel_prepare_reset(dev_priv); > >>> > >>> set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags); > >>>@@ -2680,6 +2706,7 @@ static void i915_reset_and_wakeup(struct > >>>drm_i915_private *dev_priv) > >>> kobject_uevent_env(kobj, > >>> KOBJ_CHANGE, reset_done_event); > >>> > >>>+finish: > >>> /* > >>> * Note: The wake_up also serves as a memory barrier so that > >>> * waiters see the updated value of the dev_priv->gpu_error. > >>>@@ -2781,7 +2808,7 @@ void i915_handle_error(struct drm_i915_private > >>>*dev_priv, > >>> &dev_priv->gpu_error.flags)) > >>> goto out; > >>> > >>>- i915_reset_and_wakeup(dev_priv); > >>>+ i915_reset_and_wakeup(dev_priv, engine_mask); > >> > >>? You don't need to wakeup the struct_mutex so we don't need this after > >>per-engine resets. Time to split up i915_reset_and_wakeup(), because we > >>certainly shouldn't be calling intel_finish_reset() without first calling > >>intel_prepare_reset(). Which is right here in my tree... > >> > > > >Looking at your tree, it wouldn't call finish_reset there either, only > >these two are called after a successful reset: > > > >finish: > > clear_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags); > > wake_up_all(&dev_priv->gpu_error.reset_queue); > > > >But you're right, we only need to clear the error flag, no need to call > >wake_up_all. > > > >Should I move the per-engine reset to i915_handle_error, and then leave > >i915_reset_and_wakeup just for full resets? > >That would also make the promotion from per-engine to global look a bit > >'clearer'. > > > > I just noticed an issue if I don't call wake_up_all. There can be > someone else waiting for the reset to complete > (i915_mutex_lock_interruptible -> i915_gem_wait_for_error). > > I915_RESET_BACKOFF has/had 2 roles, stop any other user to grab the > struct mutex (which we won't need in reset-engine) and prevent two > concurrent reset attempts (which we still want). Time to add a new > flag for the later? (I915_RESET_ENGINE_IN_PROGRESS?) Yes, that would be a good idea to avoid dual purposing the bits. Now that we do direct resets along the wait path, we can completely drop the i915_mutex_interruptible(). (No one else should be holding the mutex indefinitely.) I think that's a better approach -- I think we've already moved all the EIO magic aware to the ABI points where we deemed it necessary. -Chris
On 5/12/2017 2:09 PM, Chris Wilson wrote: > On Fri, May 12, 2017 at 01:55:11PM -0700, Michel Thierry wrote: >> On 5/8/2017 11:31 AM, Michel Thierry wrote: >>> On 4/29/2017 7:19 AM, Chris Wilson wrote: >>>> On Thu, Apr 27, 2017 at 04:12:42PM -0700, Michel Thierry wrote: >>>>> From: Arun Siluvery <arun.siluvery@linux.intel.com> >>>>> >> ... >>>>> + } >>>>> + >>>>> intel_prepare_reset(dev_priv); >>>>> >>>>> set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags); >>>>> @@ -2680,6 +2706,7 @@ static void i915_reset_and_wakeup(struct >>>>> drm_i915_private *dev_priv) >>>>> kobject_uevent_env(kobj, >>>>> KOBJ_CHANGE, reset_done_event); >>>>> >>>>> +finish: >>>>> /* >>>>> * Note: The wake_up also serves as a memory barrier so that >>>>> * waiters see the updated value of the dev_priv->gpu_error. >>>>> @@ -2781,7 +2808,7 @@ void i915_handle_error(struct drm_i915_private >>>>> *dev_priv, >>>>> &dev_priv->gpu_error.flags)) >>>>> goto out; >>>>> >>>>> - i915_reset_and_wakeup(dev_priv); >>>>> + i915_reset_and_wakeup(dev_priv, engine_mask); >>>> >>>> ? You don't need to wakeup the struct_mutex so we don't need this after >>>> per-engine resets. Time to split up i915_reset_and_wakeup(), because we >>>> certainly shouldn't be calling intel_finish_reset() without first calling >>>> intel_prepare_reset(). Which is right here in my tree... >>>> >>> >>> Looking at your tree, it wouldn't call finish_reset there either, only >>> these two are called after a successful reset: >>> >>> finish: >>> clear_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags); >>> wake_up_all(&dev_priv->gpu_error.reset_queue); >>> >>> But you're right, we only need to clear the error flag, no need to call >>> wake_up_all. >>> >>> Should I move the per-engine reset to i915_handle_error, and then leave >>> i915_reset_and_wakeup just for full resets? >>> That would also make the promotion from per-engine to global look a bit >>> 'clearer'. >>> >> >> I just noticed an issue if I don't call wake_up_all. There can be >> someone else waiting for the reset to complete >> (i915_mutex_lock_interruptible -> i915_gem_wait_for_error). >> >> I915_RESET_BACKOFF has/had 2 roles, stop any other user to grab the >> struct mutex (which we won't need in reset-engine) and prevent two >> concurrent reset attempts (which we still want). Time to add a new >> flag for the later? (I915_RESET_ENGINE_IN_PROGRESS?) > > Yes, that would be a good idea to avoid dual purposing the bits. Now > that we do direct resets along the wait path, we can completely drop the > i915_mutex_interruptible(). (No one else should be holding the mutex > indefinitely.) I think that's a better approach -- I think we've already > moved all the EIO magic aware to the ABI points where we deemed it > necessary. And it seems to work ok with the new flag and no wake_up. I'll run more tests. Thanks
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c index c7d68e789642..48c8b69d9bde 100644 --- a/drivers/gpu/drm/i915/i915_drv.c +++ b/drivers/gpu/drm/i915/i915_drv.c @@ -1800,6 +1800,8 @@ void i915_reset(struct drm_i915_private *dev_priv) if (!test_bit(I915_RESET_HANDOFF, &error->flags)) return; + DRM_DEBUG_DRIVER("resetting chip\n"); + /* Clear any previous failed attempts at recovery. Time to try again. */ if (!i915_gem_unset_wedged(dev_priv)) goto wakeup; @@ -1863,6 +1865,19 @@ void i915_reset(struct drm_i915_private *dev_priv) goto finish; } +/** + * i915_reset_engine - reset GPU engine to recover from a hang + * @engine: engine to reset + * + * Reset a specific GPU engine. Useful if a hang is detected. + * Returns zero on successful reset or otherwise an error code. + */ +int i915_reset_engine(struct intel_engine_cs *engine) +{ + /* FIXME: replace me with engine reset sequence */ + return -ENODEV; +} + static int i915_pm_suspend(struct device *kdev) { struct pci_dev *pdev = to_pci_dev(kdev); diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index e06af46f5a57..ab7e68626c49 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -814,6 +814,7 @@ struct intel_csr { func(has_ddi); \ func(has_decoupled_mmio); \ func(has_dp_mst); \ + func(has_reset_engine); \ func(has_fbc); \ func(has_fpga_dbg); \ func(has_full_ppgtt); \ @@ -3019,6 +3020,8 @@ extern void i915_driver_unload(struct drm_device *dev); extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask); extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv); extern void i915_reset(struct drm_i915_private *dev_priv); +extern int i915_reset_engine(struct intel_engine_cs *engine); +extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv); extern int intel_guc_reset(struct drm_i915_private *dev_priv); extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine); extern void intel_hangcheck_init(struct drm_i915_private *dev_priv); diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index fd97fe00cd0d..3a59ef1367ec 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -2635,11 +2635,13 @@ static irqreturn_t gen8_irq_handler(int irq, void *arg) /** * i915_reset_and_wakeup - do process context error handling work * @dev_priv: i915 device private + * @engine_mask: engine(s) hung - for reset-engine only. * * Fire an error uevent so userspace can see that a hang or error * was detected. */ -static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv) +static void +i915_reset_and_wakeup(struct drm_i915_private *dev_priv, u32 engine_mask) { struct kobject *kobj = &dev_priv->drm.primary->kdev->kobj; char *error_event[] = { I915_ERROR_UEVENT "=1", NULL }; @@ -2648,9 +2650,33 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv) kobject_uevent_env(kobj, KOBJ_CHANGE, error_event); - DRM_DEBUG_DRIVER("resetting chip\n"); + /* + * This event needs to be sent before performing gpu reset. When + * engine resets are supported we iterate through all engines and + * reset hung engines individually. To keep the event dispatch + * mechanism consistent with full gpu reset, this is only sent once + * even when multiple engines are hung. It is also safe to move this + * here because when we are in this function, we will definitely + * perform gpu reset. + */ kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event); + /* try engine reset first, and continue if fails; look mom, no mutex! */ + if (intel_has_reset_engine(dev_priv)) { + struct intel_engine_cs *engine; + unsigned int tmp; + + for_each_engine_masked(engine, dev_priv, engine_mask, tmp) { + if (i915_reset_engine(engine) == 0) + engine_mask &= ~intel_engine_flag(engine); + } + + if (engine_mask) + DRM_WARN("per-engine reset failed, promoting to full gpu reset\n"); + else + goto finish; + } + intel_prepare_reset(dev_priv); set_bit(I915_RESET_HANDOFF, &dev_priv->gpu_error.flags); @@ -2680,6 +2706,7 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv) kobject_uevent_env(kobj, KOBJ_CHANGE, reset_done_event); +finish: /* * Note: The wake_up also serves as a memory barrier so that * waiters see the updated value of the dev_priv->gpu_error. @@ -2781,7 +2808,7 @@ void i915_handle_error(struct drm_i915_private *dev_priv, &dev_priv->gpu_error.flags)) goto out; - i915_reset_and_wakeup(dev_priv); + i915_reset_and_wakeup(dev_priv, engine_mask); out: intel_runtime_pm_put(dev_priv); diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c index f87b0c4e564d..d5002b55cbd8 100644 --- a/drivers/gpu/drm/i915/i915_pci.c +++ b/drivers/gpu/drm/i915/i915_pci.c @@ -313,7 +313,8 @@ static const struct intel_device_info intel_haswell_info = { BDW_COLORS, \ .has_logical_ring_contexts = 1, \ .has_full_48bit_ppgtt = 1, \ - .has_64bit_reloc = 1 + .has_64bit_reloc = 1, \ + .has_reset_engine = 1 static const struct intel_device_info intel_broadwell_info = { BDW_FEATURES, @@ -345,6 +346,7 @@ static const struct intel_device_info intel_cherryview_info = { .has_gmch_display = 1, .has_aliasing_ppgtt = 1, .has_full_ppgtt = 1, + .has_reset_engine = 1, .display_mmio_offset = VLV_DISPLAY_BASE, GEN_CHV_PIPEOFFSETS, CURSOR_OFFSETS, @@ -394,6 +396,7 @@ static const struct intel_device_info intel_skylake_gt3_info = { .has_aliasing_ppgtt = 1, \ .has_full_ppgtt = 1, \ .has_full_48bit_ppgtt = 1, \ + .has_reset_engine = 1, \ GEN_DEFAULT_PIPEOFFSETS, \ IVB_CURSOR_OFFSETS, \ BDW_COLORS diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c index 07a722f74fa1..ab5bdd110ac3 100644 --- a/drivers/gpu/drm/i915/intel_uncore.c +++ b/drivers/gpu/drm/i915/intel_uncore.c @@ -1776,6 +1776,17 @@ bool intel_has_gpu_reset(struct drm_i915_private *dev_priv) return intel_get_gpu_reset(dev_priv) != NULL; } +/* + * When GuC submission is enabled, GuC manages ELSP and can initiate the + * engine reset too. For now, fall back to full GPU reset if it is enabled. + */ +bool intel_has_reset_engine(struct drm_i915_private *dev_priv) +{ + return (dev_priv->info.has_reset_engine && + !dev_priv->guc.execbuf_client && + i915.reset == 2); +} + int intel_guc_reset(struct drm_i915_private *dev_priv) { int ret;