From patchwork Fri Feb 18 21:33:05 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 12751947 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id D7F48C433F5 for ; Fri, 18 Feb 2022 21:33:17 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 2042B10EA5E; Fri, 18 Feb 2022 21:33:12 +0000 (UTC) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by gabe.freedesktop.org (Postfix) with ESMTPS id 8B50910EA53; Fri, 18 Feb 2022 21:33:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1645219988; x=1676755988; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=NoZlZ9ZEe+pAsQl/jbPO0eGAzLR3mb2aSlY4gjgTjtw=; b=hHOqvFWb2MvhK0FI6Ix54DsQZRb8lmdm8herqvbl+nmJBBKkHghchf4f iB/GvFNEKtLWnH+rMERl10OOJ2ivc/OKwyoBxgvTiUV2JSeLVtThB4Me8 ktqYLUIr+CzB3KyLx7QR0sM/EObrF2LZ0U866xEfL4VdC73G/0HR4+p3L ZJn/ne1kgXVvEfUrvkB9qYrQ257qizpncwnQxS0T/96MyAkFaOiN5hnFL p8bP4cqzHiU9Jg8Xxw2xzvGao/L8I/ckGkpKVhTvO0ChD0xPBhxJzXMhU Gpyq9zlmMYk/V4OJDbk9VJk08LGGMj9993h+1eXJ5z/uuU1yxt5nmkS+X w==; X-IronPort-AV: E=McAfee;i="6200,9189,10262"; a="238638709" X-IronPort-AV: E=Sophos;i="5.88,379,1635231600"; d="scan'208";a="238638709" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Feb 2022 13:33:07 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,379,1635231600"; d="scan'208";a="546499005" Received: from relo-linux-5.jf.intel.com ([10.165.21.134]) by orsmga008.jf.intel.com with ESMTP; 18 Feb 2022 13:33:07 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Date: Fri, 18 Feb 2022 13:33:05 -0800 Message-Id: <20220218213307.1338478-2-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220218213307.1338478-1-John.C.Harrison@Intel.com> References: <20220218213307.1338478-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ Subject: [Intel-gfx] [PATCH 1/3] drm/i915/guc: Limit scheduling properties to avoid overflow X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: DRI-Devel@Lists.FreeDesktop.Org Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" From: John Harrison GuC converts the pre-emption timeout and timeslice quantum values into clock ticks internally. That significantly reduces the point of 32bit overflow. On current platforms, worst case scenario is approximately 110 seconds. Rather than allowing the user to set higher values and then get confused by early timeouts, add limits when setting these values. Signed-off-by: John Harrison Reviewed-by: Daniele Ceraolo Spurio --- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 15 +++++++++++++++ drivers/gpu/drm/i915/gt/sysfs_engines.c | 14 ++++++++++++++ drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h | 9 +++++++++ 3 files changed, 38 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c index e53008b4dd05..2a1e9f36e6f5 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c @@ -389,6 +389,21 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id, if (GRAPHICS_VER(i915) == 12 && engine->class == RENDER_CLASS) engine->props.preempt_timeout_ms = 0; + /* Cap timeouts to prevent overflow inside GuC */ + if (intel_guc_submission_is_wanted(>->uc.guc)) { + if (engine->props.timeslice_duration_ms > GUC_POLICY_MAX_EXEC_QUANTUM_MS) { + drm_info(&engine->i915->drm, "Warning, clamping timeslice duration to %d to prevent possibly overflow\n", + GUC_POLICY_MAX_EXEC_QUANTUM_MS); + engine->props.timeslice_duration_ms = GUC_POLICY_MAX_EXEC_QUANTUM_MS; + } + + if (engine->props.preempt_timeout_ms > GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS) { + drm_info(&engine->i915->drm, "Warning, clamping pre-emption timeout to %d to prevent possibly overflow\n", + GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS); + engine->props.preempt_timeout_ms = GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS; + } + } + engine->defaults = engine->props; /* never to change again */ engine->context_size = intel_engine_context_size(gt, engine->class); diff --git a/drivers/gpu/drm/i915/gt/sysfs_engines.c b/drivers/gpu/drm/i915/gt/sysfs_engines.c index 967031056202..f57efe026474 100644 --- a/drivers/gpu/drm/i915/gt/sysfs_engines.c +++ b/drivers/gpu/drm/i915/gt/sysfs_engines.c @@ -221,6 +221,13 @@ timeslice_store(struct kobject *kobj, struct kobj_attribute *attr, if (duration > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT)) return -EINVAL; + if (intel_uc_uses_guc_submission(&engine->gt->uc) && + duration > GUC_POLICY_MAX_EXEC_QUANTUM_MS) { + duration = GUC_POLICY_MAX_EXEC_QUANTUM_MS; + drm_info(&engine->i915->drm, "Warning, clamping timeslice duration to %lld to prevent possibly overflow\n", + duration); + } + WRITE_ONCE(engine->props.timeslice_duration_ms, duration); if (execlists_active(&engine->execlists)) @@ -325,6 +332,13 @@ preempt_timeout_store(struct kobject *kobj, struct kobj_attribute *attr, if (timeout > jiffies_to_msecs(MAX_SCHEDULE_TIMEOUT)) return -EINVAL; + if (intel_uc_uses_guc_submission(&engine->gt->uc) && + timeout > GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS) { + timeout = GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS; + drm_info(&engine->i915->drm, "Warning, clamping pre-emption timeout to %lld to prevent possibly overflow\n", + timeout); + } + WRITE_ONCE(engine->props.preempt_timeout_ms, timeout); if (READ_ONCE(engine->execlists.pending[0])) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h index 6a4612a852e2..ad131092f8df 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h @@ -248,6 +248,15 @@ struct guc_lrc_desc { #define GLOBAL_POLICY_DEFAULT_DPC_PROMOTE_TIME_US 500000 +/* + * GuC converts the timeout to clock ticks internally. Different platforms have + * different GuC clocks. Thus, the maximum value before overflow is platform + * dependent. Current worst case scenario is about 110s. So, limit to 100s to be + * safe. + */ +#define GUC_POLICY_MAX_EXEC_QUANTUM_MS (100 * 1000) +#define GUC_POLICY_MAX_PREEMPT_TIMEOUT_MS (100 * 1000) + struct guc_policies { u32 submission_queue_depth[GUC_MAX_ENGINE_CLASSES]; /* In micro seconds. How much time to allow before DPC processing is From patchwork Fri Feb 18 21:33:06 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 12751948 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 9775FC433FE for ; Fri, 18 Feb 2022 21:33:18 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 018A610EA5B; Fri, 18 Feb 2022 21:33:13 +0000 (UTC) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by gabe.freedesktop.org (Postfix) with ESMTPS id C63D110EA54; Fri, 18 Feb 2022 21:33:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1645219988; x=1676755988; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=5kp1h7MtBBQbK32PJp/Ml6oXSWTgMBwujtdh24JjeiE=; b=jD1f4wvOdEbJwmb4n/8xufbFXcjDpAI3asId4ereSuzsN2N6Vsu9kcYu vxlQB0kbD2hqUIrS6TolNFd+S2DRhVULILzWGTtBdhxN/CCj6F9AcpyvO EAA4NqjARNFWQiZ3oTYo2l00KEhTWzABcQqfFzCn5jR943l3r6Bc4pHsg j8NXdx+lcATJpmRRiPYZ/8tyLX18ntgGtnD3PmCS/NMqNKChgME1YnvXb WuRFIXjM8d4Lv6guZ4moHMwgbCb8nchtp4dnoNO+nZHfm5h31u6/eBkSa PgHegmxQ1s3eM3Qh1tFwxSbs/P+4BiswA8RptYeAJx8eacSb0vbgLYi1N A==; X-IronPort-AV: E=McAfee;i="6200,9189,10262"; a="238638711" X-IronPort-AV: E=Sophos;i="5.88,379,1635231600"; d="scan'208";a="238638711" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Feb 2022 13:33:07 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,379,1635231600"; d="scan'208";a="546499008" Received: from relo-linux-5.jf.intel.com ([10.165.21.134]) by orsmga008.jf.intel.com with ESMTP; 18 Feb 2022 13:33:07 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Date: Fri, 18 Feb 2022 13:33:06 -0800 Message-Id: <20220218213307.1338478-3-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220218213307.1338478-1-John.C.Harrison@Intel.com> References: <20220218213307.1338478-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ Subject: [Intel-gfx] [PATCH 2/3] drm/i915/gt: Make the heartbeat play nice with long pre-emption timeouts X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: DRI-Devel@Lists.FreeDesktop.Org Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" From: John Harrison Compute workloads are inherantly not pre-emptible for long periods on current hardware. As a workaround for this, the pre-emption timeout for compute capable engines was disabled. This is undesirable with GuC submission as it prevents per engine reset of hung contexts. Hence the next patch will re-enable the timeout but bumped up by an order of magnititude. However, the heartbeat might not respect that. Depending upon current activity, a pre-emption to the heartbeat pulse might not even be attempted until the last heartbeat period. Which means that only one period is granted for the pre-emption to occur. With the aforesaid bump, the pre-emption timeout could be significantly larger than this heartbeat period. So adjust the heartbeat code to take the pre-emption timeout into account. When it reaches the final (high priority) period, it now ensures the delay before hitting reset is bigger than the pre-emption timeout. Signed-off-by: John Harrison --- drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c index a3698f611f45..72a82a6085e0 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c @@ -22,9 +22,25 @@ static bool next_heartbeat(struct intel_engine_cs *engine) { + struct i915_request *rq; long delay; delay = READ_ONCE(engine->props.heartbeat_interval_ms); + + rq = engine->heartbeat.systole; + if (rq && rq->sched.attr.priority >= I915_PRIORITY_BARRIER) { + long longer; + + /* + * The final try is at the highest priority possible. Up until now + * a pre-emption might not even have been attempted. So make sure + * this last attempt allows enough time for a pre-emption to occur. + */ + longer = READ_ONCE(engine->props.preempt_timeout_ms) * 2; + if (longer > delay) + delay = longer; + } + if (!delay) return false; From patchwork Fri Feb 18 21:33:07 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 12751946 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 13D5EC433EF for ; Fri, 18 Feb 2022 21:33:17 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id F15D910EA55; Fri, 18 Feb 2022 21:33:11 +0000 (UTC) Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by gabe.freedesktop.org (Postfix) with ESMTPS id 05EF110EA3C; Fri, 18 Feb 2022 21:33:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1645219989; x=1676755989; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=ncVfe66uzdNKMKj7i/OYYYxUGUlYy74ul0X4fkC9lMo=; b=nidMRqwWhBKKx0i6o18cg6orGz7TAFAEuo+o75IMoavqouK06ProMyA9 QVredHN668r/dBPcWaprPyNfjDGoGb023NgNcfzPIoEzQp/RmJbGx0lcn ZgEFqwA2PN4Gl1RTGOebUmilagGledI2+uwLfba2gcP3hSinx30SrJXui PiPXbOpz5C2Wb1n6e8sRWta7Xo68XhBzKYfFHykUVKxpTcfh6Jl+yyrT2 ZUEwF8sI113eSizYynFREk+EOberAjbCXS3gtw4xW9hqsv3bg9nojB0Qt t7DZtp47Kw0A/zSesvYc58XUd9QRyxqP1NZbupMxXW5iBt4zLXfi+R6U8 Q==; X-IronPort-AV: E=McAfee;i="6200,9189,10262"; a="238638713" X-IronPort-AV: E=Sophos;i="5.88,379,1635231600"; d="scan'208";a="238638713" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Feb 2022 13:33:07 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,379,1635231600"; d="scan'208";a="546499011" Received: from relo-linux-5.jf.intel.com ([10.165.21.134]) by orsmga008.jf.intel.com with ESMTP; 18 Feb 2022 13:33:07 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Date: Fri, 18 Feb 2022 13:33:07 -0800 Message-Id: <20220218213307.1338478-4-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220218213307.1338478-1-John.C.Harrison@Intel.com> References: <20220218213307.1338478-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ Subject: [Intel-gfx] [PATCH 3/3] drm/i915: Improve long running OCL w/a for GuC submission X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Michal Mrozek , DRI-Devel@Lists.FreeDesktop.Org Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" From: John Harrison A workaround was added to the driver to allow OpenCL workloads to run 'forever' by disabling pre-emption on the RCS engine for Gen12. It is not totally unbound as the heartbeat will kick in eventually and cause a reset of the hung engine. However, this does not work well in GuC submission mode. In GuC mode, the pre-emption timeout is how GuC detects hung contexts and triggers a per engine reset. Thus, disabling the timeout means also losing all per engine reset ability. A full GT reset will still occur when the heartbeat finally expires, but that is a much more destructive and undesirable mechanism. The purpose of the workaround is actually to give OpenCL tasks longer to reach a pre-emption point after a pre-emption request has been issued. This is necessary because Gen12 does not support mid-thread pre-emption and OpenCL can have long running threads. So, rather than disabling the timeout completely, just set it to a 'long' value. CC: Michal Mrozek Signed-off-by: John Harrison Reviewed-by: Daniele Ceraolo Spurio --- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c index 2a1e9f36e6f5..64249301a227 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c @@ -385,9 +385,25 @@ static int intel_engine_setup(struct intel_gt *gt, enum intel_engine_id id, engine->props.timeslice_duration_ms = CONFIG_DRM_I915_TIMESLICE_DURATION; - /* Override to uninterruptible for OpenCL workloads. */ - if (GRAPHICS_VER(i915) == 12 && engine->class == RENDER_CLASS) - engine->props.preempt_timeout_ms = 0; + /* + * Mid-thread pre-emption is not available in Gen12. Unfortunately, + * some OpenCL workloads run quite long threads. That means they get + * reset due to not pre-empting in a timely manner. So, bump the + * pre-emption timeout value to be much higher for compute engines. + * Using three times the heartbeat period seems long enough for a + * reasonable task to reach a pre-emption point but not so long as to + * allow genuine hangs to go unresolved. + */ + if (GRAPHICS_VER(i915) == 12 && engine->class == RENDER_CLASS) { + unsigned long triple_beat = engine->props.heartbeat_interval_ms * 3; + + if (triple_beat > engine->props.preempt_timeout_ms) { + drm_info(>->i915->drm, "Bumping pre-emption timeout from %ld to %ld on %s to allow slow compute pre-emption\n", + engine->props.preempt_timeout_ms, triple_beat, engine->name); + + engine->props.preempt_timeout_ms = triple_beat; + } + } /* Cap timeouts to prevent overflow inside GuC */ if (intel_guc_submission_is_wanted(>->uc.guc)) {