From patchwork Thu Jan 12 04:18:11 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Michel Thierry X-Patchwork-Id: 9512113 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 16C75601E7 for ; Thu, 12 Jan 2017 04:18:32 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 09C8F2862B for ; Thu, 12 Jan 2017 04:18:32 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id F2D852863B; Thu, 12 Jan 2017 04:18:31 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.2 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 2BAC82862B for ; Thu, 12 Jan 2017 04:18:31 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 166D46EB98; Thu, 12 Jan 2017 04:18:27 +0000 (UTC) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by gabe.freedesktop.org (Postfix) with ESMTPS id 00CC36EB0E for ; Thu, 12 Jan 2017 04:18:18 +0000 (UTC) Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga104.fm.intel.com with ESMTP; 11 Jan 2017 20:18:18 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.33,348,1477983600"; d="scan'208";a="29330432" Received: from relo-linux-11.sc.intel.com ([10.3.160.214]) by orsmga002.jf.intel.com with ESMTP; 11 Jan 2017 20:18:18 -0800 From: Michel Thierry To: intel-gfx@lists.freedesktop.org Date: Wed, 11 Jan 2017 20:18:11 -0800 Message-Id: <20170112041817.1102-5-michel.thierry@intel.com> X-Mailer: git-send-email 2.11.0 In-Reply-To: <20170112041817.1102-1-michel.thierry@intel.com> References: <20170112041817.1102-1-michel.thierry@intel.com> Subject: [Intel-gfx] [PATCH 04/10] drm/i915/tdr: Modify error handler for per engine hang recovery X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" X-Virus-Scanned: ClamAV using ClamSMTP From: Arun Siluvery This is a preparatory patch which modifies error handler to do per engine hang recovery. The actual patch which implements this sequence follows later in the series. The aim is to prepare existing recovery function to adapt to this new function where applicable (which fails at this point because core implementation is lacking) and continue recovery using legacy full gpu reset. A helper function is also added to query the availability of engine reset. The error events behaviour that are used to notify user of reset are adapted to engine reset such that it doesn't break users listening to these events. In legacy we report an error event, a reset event before resetting the gpu and a reset done event marking the completion of reset. The same behaviour is adapted but reset event is only dispatched once even when multiple engines are hung. Finally once reset is complete we send reset done event as usual. Note that this implementation of engine reset is for i915 directly submitting to the ELSP, where the driver manages the hang detection, recovery and resubmission. With GuC submission these tasks are shared between driver and firmware; i915 will still responsible for detecting a hang, and when it does it will have to request GuC to reset that Engine and remind the firmware about the outstanding submissions. v2: rebase, advertise engine reset availability in platform definition, add note about GuC submission. v3: s/*engine_reset*/*reset_engine*/. (Chris) Handle reset as 2 level resets, by first going to engine only and fall backing to full/chip reset as needed, i.e. reset_engine will need the struct_mutex. Cc: Chris Wilson Cc: Mika Kuoppala Signed-off-by: Ian Lister Signed-off-by: Tomas Elf Signed-off-by: Arun Siluvery Signed-off-by: Michel Thierry --- drivers/gpu/drm/i915/i915_drv.c | 50 +++++++++++++++++++++++++++++++++++-- drivers/gpu/drm/i915/i915_drv.h | 5 ++++ drivers/gpu/drm/i915/i915_irq.c | 31 +++++++++++++++++------ drivers/gpu/drm/i915/i915_pci.c | 5 +++- drivers/gpu/drm/i915/intel_uncore.c | 11 ++++++++ 5 files changed, 92 insertions(+), 10 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c index 4e5ea5898e06..66b620a7f6ba 100644 --- a/drivers/gpu/drm/i915/i915_drv.c +++ b/drivers/gpu/drm/i915/i915_drv.c @@ -1745,7 +1745,7 @@ static void enable_engines_irq(struct drm_i915_private *dev_priv) } /** - * i915_reset - reset chip after a hang + * i915_reset_chip - reset chip after a hang * @dev_priv: device private to reset * * Reset the chip. Useful if a hang is detected. Marks the device as wedged @@ -1761,7 +1761,7 @@ static void enable_engines_irq(struct drm_i915_private *dev_priv) * - re-init interrupt state * - re-init display */ -void i915_reset(struct drm_i915_private *dev_priv) +void i915_reset_chip(struct drm_i915_private *dev_priv) { struct i915_gpu_error *error = &dev_priv->gpu_error; int ret; @@ -1771,6 +1771,8 @@ void i915_reset(struct drm_i915_private *dev_priv) if (!test_and_clear_bit(I915_RESET_IN_PROGRESS, &error->flags)) return; + DRM_DEBUG_DRIVER("resetting chip\n"); + /* Clear any previous failed attempts at recovery. Time to try again. */ __clear_bit(I915_WEDGED, &error->flags); error->reset_count++; @@ -1824,6 +1826,50 @@ void i915_reset(struct drm_i915_private *dev_priv) goto wakeup; } +/** + * i915_reset_engine - reset GPU engine to recover from a hang + * @engine: engine to reset + * + * Reset a specific GPU engine. Useful if a hang is detected. + * Returns zero on successful reset or otherwise an error code. + */ +int i915_reset_engine(struct intel_engine_cs *engine) +{ + /* FIXME: replace me with engine reset sequence */ + return -ENODEV; +} + +/** + * i915_reset - start either engine or full GPU reset to recover from a hang + * @dev_priv: device private + * + * Wrapper function to initiate a GPU reset. If platform supports it, attempt + * to reset the hung engine(s) only. In engine reset fails (or not supported), + * reset the full GPU. + * + * Caller must hold the struct_mutex. + */ +void i915_reset(struct drm_i915_private *dev_priv) +{ + /* If hardware supports it (GEN8+), try engine reset first */ + if (intel_has_reset_engine(dev_priv)) { + struct intel_engine_cs *engine; + u32 engine_mask = dev_priv->gpu_error.reset_engine_mask; + unsigned int tmp; + + for_each_engine_masked(engine, dev_priv, engine_mask, tmp) { + /* on failure fallback to full gpu reset for recovery */ + if (i915_reset_engine(engine)) + goto error; + } + + return; + } + +error: + i915_reset_chip(dev_priv); +} + static int i915_pm_suspend(struct device *kdev) { struct pci_dev *pdev = to_pci_dev(kdev); diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 206b4e1a3a58..873fbd78124f 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -784,6 +784,7 @@ struct intel_csr { func(has_ddi); \ func(has_decoupled_mmio); \ func(has_dp_mst); \ + func(has_reset_engine); \ func(has_fbc); \ func(has_fpga_dbg); \ func(has_full_ppgtt); \ @@ -1552,6 +1553,9 @@ struct i915_gpu_error { #define I915_RESET_IN_PROGRESS 0 #define I915_WEDGED (BITS_PER_LONG - 1) + /* if available, engine-specific reset is tried before full gpu reset */ + u32 reset_engine_mask; + /** * Waitqueue to signal when a hang is detected. Used to for waiters * to release the struct_mutex for the reset to procede. @@ -2933,6 +2937,7 @@ extern void i915_driver_unload(struct drm_device *dev); extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask); extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv); extern void i915_reset(struct drm_i915_private *dev_priv); +extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv); extern int intel_guc_reset(struct drm_i915_private *dev_priv); extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine); extern void intel_hangcheck_init(struct drm_i915_private *dev_priv); diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index ce5663d94839..393a118c964e 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -2630,7 +2630,15 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv) kobject_uevent_env(kobj, KOBJ_CHANGE, error_event); - DRM_DEBUG_DRIVER("resetting chip\n"); + /* + * This event needs to be sent before performing gpu reset. When + * engine resets are supported we iterate through all engines and + * reset hung engines individually. To keep the event dispatch + * mechanism consistent with full gpu reset, this is only sent once + * even when multiple engines are hung. It is also safe to move this + * here because when we are in this function, we will definitely + * perform gpu reset. + */ kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event); /* @@ -2645,17 +2653,20 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv) do { /* - * All state reset _must_ be completed before we update the - * reset counter, for otherwise waiters might miss the reset - * pending state and not properly drop locks, resulting in - * deadlocks with the reset work. + * All state reset _must_ be completed before we update + * the reset counter, for otherwise waiters might miss + * the reset pending state and not properly drop locks, + * resulting in deadlocks with the reset work. */ if (mutex_trylock(&dev_priv->drm.struct_mutex)) { i915_reset(dev_priv); mutex_unlock(&dev_priv->drm.struct_mutex); } - /* We need to wait for anyone holding the lock to wakeup */ + /* + * We need to wait for anyone holding the lock to + * wakeup. + */ } while (wait_on_bit_timeout(&dev_priv->gpu_error.flags, I915_RESET_IN_PROGRESS, TASK_UNINTERRUPTIBLE, @@ -2664,7 +2675,7 @@ static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv) intel_finish_reset(dev_priv); intel_runtime_pm_put(dev_priv); - if (!test_bit(I915_WEDGED, &dev_priv->gpu_error.flags)) + if (!i915_terminally_wedged(&dev_priv->gpu_error)) kobject_uevent_env(kobj, KOBJ_CHANGE, reset_done_event); @@ -2760,6 +2771,12 @@ void i915_handle_error(struct drm_i915_private *dev_priv, return; /* + * Save which engines need reset; if engine support is available, + * we can just reset the hung engines. + */ + dev_priv->gpu_error.reset_engine_mask = engine_mask; + + /* * Wakeup waiting processes so that the reset function * i915_reset_and_wakeup doesn't deadlock trying to grab * various locks. By bumping the reset counter first, the woken diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c index eb7b75a16d98..d0b74250f607 100644 --- a/drivers/gpu/drm/i915/i915_pci.c +++ b/drivers/gpu/drm/i915/i915_pci.c @@ -307,7 +307,8 @@ static const struct intel_device_info intel_haswell_info = { BDW_COLORS, \ .has_logical_ring_contexts = 1, \ .has_full_48bit_ppgtt = 1, \ - .has_64bit_reloc = 1 + .has_64bit_reloc = 1, \ + .has_reset_engine = 1 static const struct intel_device_info intel_broadwell_info = { BDW_FEATURES, @@ -339,6 +340,7 @@ static const struct intel_device_info intel_cherryview_info = { .has_gmch_display = 1, .has_aliasing_ppgtt = 1, .has_full_ppgtt = 1, + .has_reset_engine = 1, .display_mmio_offset = VLV_DISPLAY_BASE, GEN_CHV_PIPEOFFSETS, CURSOR_OFFSETS, @@ -390,6 +392,7 @@ static const struct intel_device_info intel_skylake_gt3_info = { .has_aliasing_ppgtt = 1, \ .has_full_ppgtt = 1, \ .has_full_48bit_ppgtt = 1, \ + .has_reset_engine = 1, \ GEN_DEFAULT_PIPEOFFSETS, \ IVB_CURSOR_OFFSETS, \ BDW_COLORS diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c index abe08885a5ba..cf7355db5486 100644 --- a/drivers/gpu/drm/i915/intel_uncore.c +++ b/drivers/gpu/drm/i915/intel_uncore.c @@ -1851,6 +1851,17 @@ bool intel_has_gpu_reset(struct drm_i915_private *dev_priv) return intel_get_gpu_reset(dev_priv) != NULL; } +/* + * When GuC submission is enabled, GuC manages ELSP and can initiate the + * engine reset too. For now, fall back to full GPU reset if it is enabled. + */ +bool intel_has_reset_engine(struct drm_i915_private *dev_priv) +{ + return (dev_priv->info.has_reset_engine && + !dev_priv->guc.execbuf_client && + i915.reset == 2); +} + int intel_guc_reset(struct drm_i915_private *dev_priv) { int ret;