Message ID | 20170522174641.25354-4-michel.thierry@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Quoting Michel Thierry (2017-05-22 18:46:24) > + /* try engine reset first, and continue if fails */ /* Please use sentences when convenient. It looks much neater that way. */ > + if (intel_has_reset_engine(dev_priv)) { > + struct intel_engine_cs *engine; > + unsigned int tmp; > + > + /* protect against concurrent reset attempts */ > + if (test_and_set_bit(I915_RESET_ENGINE_IN_PROGRESS, > + &dev_priv->gpu_error.flags)) > + goto out; > + > + for_each_engine_masked(engine, dev_priv, engine_mask, tmp) { > + if (i915_reset_engine(engine) == 0) > + engine_mask &= ~intel_engine_flag(engine); > + } > + > + /* clear unconditionally, full reset won't care about it */ > + clear_bit(I915_RESET_ENGINE_IN_PROGRESS, > + &dev_priv->gpu_error.flags); > + > + if (!engine_mask) > + goto out; > + } > + > + /* full reset needs the mutex, stop any other user trying to do so */ > if (test_and_set_bit(I915_RESET_BACKOFF, > &dev_priv->gpu_error.flags)) We have a big gaping hole here in coordination between a global reset and per-engine resets. I think you want to do something like: if (intel_has_engine_reset()) { for_each_engine_mask() { if (test_and_set_bit(I915_RESET_ENGINE + engine->id)) continue; if (i915_reset_engine() == 0) engine_mask &= ~engine->mask; clear_bit(I915_RESET_ENGINE + engine->id); wake_up_bit(I915_RESET_ENGINE + engine->id); } } if (!engine_mask) goto out; if (test_and_set_bit(I915_RESET_BACKOFF)) goto out; for_each_engine() { wait_on_bit(I915_RESET_ENGINE + engine->id); set_bit(I915_RESET_ENGINE); } ...do global reset... for_each_engine() { clear_bit(I915_RESET_ENGINE); } The idea is that any per-engine reset that comes in whilst global is in progress can skip, and that the global waits for any per-engine reset to finish before clobbering state. The window in which global reset completes and a new hang is detected before we clear the bits, I am judiciously ignoring. Also this should allow us to perform parallel resets between engines. (Now that's a selling point!) -Chris
On 6/2/2017 1:16 PM, Chris Wilson wrote: > Quoting Michel Thierry (2017-05-22 18:46:24) >> + /* try engine reset first, and continue if fails */ > > /* Please use sentences when convenient. It looks much neater that way. */ > _less_ broken English: /* * Try engine reset when available. We fall back to full reset if * single reset fails. */ >> + if (intel_has_reset_engine(dev_priv)) { >> + struct intel_engine_cs *engine; >> + unsigned int tmp; >> + >> + /* protect against concurrent reset attempts */ >> + if (test_and_set_bit(I915_RESET_ENGINE_IN_PROGRESS, >> + &dev_priv->gpu_error.flags)) >> + goto out; >> + >> + for_each_engine_masked(engine, dev_priv, engine_mask, tmp) { >> + if (i915_reset_engine(engine) == 0) >> + engine_mask &= ~intel_engine_flag(engine); >> + } >> + >> + /* clear unconditionally, full reset won't care about it */ >> + clear_bit(I915_RESET_ENGINE_IN_PROGRESS, >> + &dev_priv->gpu_error.flags); >> + >> + if (!engine_mask) >> + goto out; >> + } >> + >> + /* full reset needs the mutex, stop any other user trying to do so */ >> if (test_and_set_bit(I915_RESET_BACKOFF, >> &dev_priv->gpu_error.flags)) > > We have a big gaping hole here in coordination between a global reset > and per-engine resets. > > I think you want to do something like: > > if (intel_has_engine_reset()) { > for_each_engine_mask() { > if (test_and_set_bit(I915_RESET_ENGINE + engine->id)) > continue; > > if (i915_reset_engine() == 0) > engine_mask &= ~engine->mask; > > clear_bit(I915_RESET_ENGINE + engine->id); > wake_up_bit(I915_RESET_ENGINE + engine->id); > } > } > > if (!engine_mask) > goto out; > > if (test_and_set_bit(I915_RESET_BACKOFF)) > goto out; > > for_each_engine() { > wait_on_bit(I915_RESET_ENGINE + engine->id); > set_bit(I915_RESET_ENGINE); > } > > ...do global reset... > > for_each_engine() { > clear_bit(I915_RESET_ENGINE); > } > > The idea is that any per-engine reset that comes in whilst global is in > progress can skip, and that the global waits for any per-engine reset to > finish before clobbering state. The window in which global reset > completes and a new hang is detected before we clear the bits, I am > judiciously ignoring. Also this should allow us to perform parallel > resets between engines. (Now that's a selling point!) Fair enough, I can change it to support parallel resets. One thing I forgot to ask, what should I do with the error/reset uevents? As it is, we will only tell userspace in case of global reset. Thanks
Quoting Michel Thierry (2017-06-02 21:38:56) > One thing I forgot to ask, what should I do with the error/reset > uevents? As it is, we will only tell userspace in case of global reset. My first thought is that we don't care to send the uevent for anything less than a global reset. The original use case was for abrtd style automatic bug reporters, but afaik no one is harvesting those error reports any more. I think we could drop it and no one would be the wiser. (It's not an ABI break if no one notices.) On the other hand, we do use the uevent to catch bugs (unexpected hangs) in igt. For igt, I could just disable per-engine resets for the duration of the those tests. Those tests are not about testing the reset and so missing coverage of per-engine reset is not an issue. Since this is an i915-specific feature, I think the actual listeners to this uevent are going to be rare, tending to 0. I hope. -Chris
Quoting Chris Wilson (2017-06-02 21:16:42) > I think you want to do something like: > > if (intel_has_engine_reset()) { > for_each_engine_mask() { > if (test_and_set_bit(I915_RESET_ENGINE + engine->id)) > continue; > > if (i915_reset_engine() == 0) > engine_mask &= ~engine->mask; > > clear_bit(I915_RESET_ENGINE + engine->id); > wake_up_bit(I915_RESET_ENGINE + engine->id); > } > } > > if (!engine_mask) > goto out; > > if (test_and_set_bit(I915_RESET_BACKOFF)) > goto out; > > for_each_engine() { > wait_on_bit(I915_RESET_ENGINE + engine->id); > set_bit(I915_RESET_ENGINE); > } This needs to be in a loop for that sweet layer of paranoia. while (test_and_set_bit(I915_RESET_ENGINE + engine->id)) wait_on_bit(I915_RESET_ENGINE + engine->id); -Chris
And whilst I'm here, we need to extend I915_PARAM_HAS_GPU_RESET to indicate having per-engine resets for the complimentary set of igt. -Chris
On 6/4/2017 5:06 AM, Chris Wilson wrote: > And whilst I'm here, we need to extend I915_PARAM_HAS_GPU_RESET to > indicate having per-engine resets for the complimentary set of igt. > -Chris > Something like this? case I915_PARAM_HAS_GPU_RESET: - value = i915.enable_hangcheck && intel_has_gpu_reset(dev_priv); + value = i915.enable_hangcheck; + if (value) + value = intel_has_reset_engine(dev_priv) ? 2 : + intel_has_gpu_reset(dev_priv) ? 1 : 0; break; (you'll probably think of a nicer way to do it) I'm also sending the patch with the changes to support parallel resets, some other patches need rebase after this change, but I'm holding on those. Thanks,
Quoting Michel Thierry (2017-06-06 01:40:31) > On 6/4/2017 5:06 AM, Chris Wilson wrote: > > And whilst I'm here, we need to extend I915_PARAM_HAS_GPU_RESET to > > indicate having per-engine resets for the complimentary set of igt. > > -Chris > > > > Something like this? > > case I915_PARAM_HAS_GPU_RESET: > - value = i915.enable_hangcheck && > intel_has_gpu_reset(dev_priv); > + value = i915.enable_hangcheck; > + if (value) > + value = intel_has_reset_engine(dev_priv) ? 2 : > + intel_has_gpu_reset(dev_priv) ? 1 : 0; > break; > > (you'll probably think of a nicer way to do it) I didn't think it was sensible to advertise reset-engine support without global reset (or the hangcheck to detect the error), and for the time being we can keep thinking of this as an integer rather than a set of flags. So I was just thinking of value = i915.enable_hangcheck && intel_has_gpu_reset(dev_priv); if (value && intel_has_reset_engine(dev_priv)) value = 2; If you want to propose breaking it into flags value = !!i915.enable_hangcheck; /* can't remember if this is bool */ if (intel_has_gpu_reset(dev_priv)) value |= BIT(1); if (intel_has_reset_engine(dev_priv)) value |= BIT(2); Then we need to teach igt to look at the flags. -Chris
On 6/6/2017 3:16 AM, Chris Wilson wrote: > I didn't think it was sensible to advertise reset-engine support without > global reset (or the hangcheck to detect the error), and for the time > being we can keep thinking of this as an integer rather than a set of > flags. > > So I was just thinking of > > value = i915.enable_hangcheck && intel_has_gpu_reset(dev_priv); > if (value && intel_has_reset_engine(dev_priv)) > value = 2; reset-engine will depend on global reset (it uses gen8_reset_engines with the right engine_mask), so yes, lets use this. Thanks
diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c index d703897786e9..f49f9d9273f2 100644 --- a/drivers/gpu/drm/i915/i915_drv.c +++ b/drivers/gpu/drm/i915/i915_drv.c @@ -1873,6 +1873,19 @@ void i915_reset(struct drm_i915_private *dev_priv) goto finish; } +/** + * i915_reset_engine - reset GPU engine to recover from a hang + * @engine: engine to reset + * + * Reset a specific GPU engine. Useful if a hang is detected. + * Returns zero on successful reset or otherwise an error code. + */ +int i915_reset_engine(struct intel_engine_cs *engine) +{ + /* FIXME: replace me with engine reset sequence */ + return -ENODEV; +} + static int i915_pm_suspend(struct device *kdev) { struct pci_dev *pdev = to_pci_dev(kdev); diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index f2a01bc04168..6848ed75611d 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -753,6 +753,7 @@ struct intel_csr { func(has_ddi); \ func(has_decoupled_mmio); \ func(has_dp_mst); \ + func(has_reset_engine); \ func(has_fbc); \ func(has_fpga_dbg); \ func(has_full_ppgtt); \ @@ -1546,6 +1547,10 @@ struct i915_gpu_error { * inspect the bit and do the reset directly, otherwise the worker * waits for the struct_mutex. * + * #I915_RESET_ENGINE_IN_PROGRESS - Since the driver doesn't need to + * acquire the struct_mutex to reset an engine, we need an explicit + * flag to prevent two concurrent reset-engine attempts. + * * #I915_WEDGED - If reset fails and we can no longer use the GPU, * we set the #I915_WEDGED bit. Prior to command submission, e.g. * i915_gem_request_alloc(), this bit is checked and the sequence @@ -1554,6 +1559,7 @@ struct i915_gpu_error { unsigned long flags; #define I915_RESET_BACKOFF 0 #define I915_RESET_HANDOFF 1 +#define I915_RESET_ENGINE_IN_PROGRESS 2 #define I915_WEDGED (BITS_PER_LONG - 1) /** @@ -3038,6 +3044,8 @@ extern void i915_driver_unload(struct drm_device *dev); extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask); extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv); extern void i915_reset(struct drm_i915_private *dev_priv); +extern int i915_reset_engine(struct intel_engine_cs *engine); +extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv); extern int intel_guc_reset(struct drm_i915_private *dev_priv); extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine); extern void intel_hangcheck_init(struct drm_i915_private *dev_priv); diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index 7b7f55a28eec..da4f49447fb0 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -2743,6 +2743,30 @@ void i915_handle_error(struct drm_i915_private *dev_priv, if (!engine_mask) goto out; + /* try engine reset first, and continue if fails */ + if (intel_has_reset_engine(dev_priv)) { + struct intel_engine_cs *engine; + unsigned int tmp; + + /* protect against concurrent reset attempts */ + if (test_and_set_bit(I915_RESET_ENGINE_IN_PROGRESS, + &dev_priv->gpu_error.flags)) + goto out; + + for_each_engine_masked(engine, dev_priv, engine_mask, tmp) { + if (i915_reset_engine(engine) == 0) + engine_mask &= ~intel_engine_flag(engine); + } + + /* clear unconditionally, full reset won't care about it */ + clear_bit(I915_RESET_ENGINE_IN_PROGRESS, + &dev_priv->gpu_error.flags); + + if (!engine_mask) + goto out; + } + + /* full reset needs the mutex, stop any other user trying to do so */ if (test_and_set_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags)) goto out; diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c index f80db2ccd92f..4dfb400aef85 100644 --- a/drivers/gpu/drm/i915/i915_pci.c +++ b/drivers/gpu/drm/i915/i915_pci.c @@ -310,7 +310,8 @@ static const struct intel_device_info intel_haswell_info = { BDW_COLORS, \ .has_logical_ring_contexts = 1, \ .has_full_48bit_ppgtt = 1, \ - .has_64bit_reloc = 1 + .has_64bit_reloc = 1, \ + .has_reset_engine = 1 static const struct intel_device_info intel_broadwell_info = { BDW_FEATURES, @@ -341,6 +342,7 @@ static const struct intel_device_info intel_cherryview_info = { .has_gmch_display = 1, .has_aliasing_ppgtt = 1, .has_full_ppgtt = 1, + .has_reset_engine = 1, .display_mmio_offset = VLV_DISPLAY_BASE, GEN_CHV_PIPEOFFSETS, CURSOR_OFFSETS, @@ -389,6 +391,7 @@ static const struct intel_device_info intel_skylake_gt3_info = { .has_aliasing_ppgtt = 1, \ .has_full_ppgtt = 1, \ .has_full_48bit_ppgtt = 1, \ + .has_reset_engine = 1, \ GEN_DEFAULT_PIPEOFFSETS, \ IVB_CURSOR_OFFSETS, \ BDW_COLORS diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c index 9269cae5713a..fdfd8c66c956 100644 --- a/drivers/gpu/drm/i915/intel_uncore.c +++ b/drivers/gpu/drm/i915/intel_uncore.c @@ -1836,6 +1836,17 @@ bool intel_has_gpu_reset(struct drm_i915_private *dev_priv) return intel_get_gpu_reset(dev_priv) != NULL; } +/* + * When GuC submission is enabled, GuC manages ELSP and can initiate the + * engine reset too. For now, fall back to full GPU reset if it is enabled. + */ +bool intel_has_reset_engine(struct drm_i915_private *dev_priv) +{ + return (dev_priv->info.has_reset_engine && + !dev_priv->guc.execbuf_client && + i915.reset >= 2); +} + int intel_guc_reset(struct drm_i915_private *dev_priv) { int ret;