diff mbox series

i915/guc: Get runtime pm in busyness worker only if already active

Message ID 20230915012857.2159217-1-umesh.nerlige.ramappa@intel.com (mailing list archive)
State New, archived
Headers show
Series i915/guc: Get runtime pm in busyness worker only if already active | expand

Commit Message

Umesh Nerlige Ramappa Sept. 15, 2023, 1:28 a.m. UTC
Ideally the busyness worker should take a gt pm wakeref because the
worker only needs to be active while gt is awake. However, the gt_park
path cancels the worker synchronously and this complicates the flow if
the worker is also running at the same time. The cancel waits for the
worker and when the worker releases the wakeref, that would call gt_park
and would lead to a deadlock.

The resolution is to take the global pm wakeref if runtime pm is already
active. If not, we don't need to update the busyness stats as the stats
would already be updated when the gt was parked.

Note:
- We do not requeue the worker if we cannot take a reference to runtime
  pm since intel_guc_busyness_unpark would requeue the worker in the
  resume path.

- If the gt was parked longer than time taken for GT timestamp to roll
  over, we ignore those rollovers since we don't care about tracking the
  exact GT time. We only care about roll overs when the gt is active and
  running workloads.

- There is a window of time between gt_park and runtime suspend, where
  the worker may run. This is acceptable since the worker will not find
  any new data to update busyness.

v2: (Daniele)
- Edit commit message and code comment
- Use runtime pm in the worker
- Put runtime pm after enabling the worker
- Use Link tag and add Fixes tag

v3: (Daniele)
- Reword commit and comments and add details

Link: https://gitlab.freedesktop.org/drm/intel/-/issues/7077
Fixes: 77cdd054dd2c ("drm/i915/pmu: Connect engine busyness stats from GuC to pmu")
Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 38 +++++++++++++++++--
 1 file changed, 35 insertions(+), 3 deletions(-)

Comments

Umesh Nerlige Ramappa Sept. 15, 2023, 7:56 p.m. UTC | #1
On Fri, Sep 15, 2023 at 05:37:53AM +0000, Patchwork wrote:
>   Patch Details
>
>Series:  i915/guc: Get runtime pm in busyness worker only if already active
>URL:     [1]https://patchwork.freedesktop.org/series/123744/
>State:   failure
>Details: [2]https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_123744v1/index.html
>
>          CI Bug Log - changes from CI_DRM_13635 -> Patchwork_123744v1
>
>Summary
>
>   FAILURE
>
>   Serious unknown changes coming with Patchwork_123744v1 absolutely need to
>   be
>   verified manually.
>
>   If you think the reported changes have nothing to do with the changes
>   introduced in Patchwork_123744v1, please notify your bug team
>   (lgci.bug.filing@intel.com) to allow them
>   to document this new failure mode, which will reduce false positives in
>   CI.
>
>   External URL:
>   https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_123744v1/index.html
>
>Participating hosts (40 -> 40)
>
>   Additional (1): fi-kbl-soraka
>   Missing (1): fi-snb-2520m
>
>Possible new issues
>
>   Here are the unknown changes that may have been introduced in
>   Patchwork_123744v1:
>
>  IGT changes
>
>    Possible regressions
>
>     * igt@i915_selftest@live@hangcheck:
>
>          * fi-skl-guc: [3]PASS -> [4]DMESG-FAIL

Not related to changes in this patch. Ran the test a couple times on MTL 
and not seeing any failures.

>
>     * igt@i915_selftest@live@perf:
>
>          * fi-kbl-soraka: NOTRUN -> [5]ABORT

Unrelated since this does not use GuC.

Umesh

>
>Known issues
>
Daniele Ceraolo Spurio Sept. 15, 2023, 11:55 p.m. UTC | #2
On 9/14/2023 6:28 PM, Umesh Nerlige Ramappa wrote:
> Ideally the busyness worker should take a gt pm wakeref because the
> worker only needs to be active while gt is awake. However, the gt_park
> path cancels the worker synchronously and this complicates the flow if
> the worker is also running at the same time. The cancel waits for the
> worker and when the worker releases the wakeref, that would call gt_park
> and would lead to a deadlock.
>
> The resolution is to take the global pm wakeref if runtime pm is already
> active. If not, we don't need to update the busyness stats as the stats
> would already be updated when the gt was parked.
>
> Note:
> - We do not requeue the worker if we cannot take a reference to runtime
>    pm since intel_guc_busyness_unpark would requeue the worker in the
>    resume path.
>
> - If the gt was parked longer than time taken for GT timestamp to roll
>    over, we ignore those rollovers since we don't care about tracking the
>    exact GT time. We only care about roll overs when the gt is active and
>    running workloads.
>
> - There is a window of time between gt_park and runtime suspend, where
>    the worker may run. This is acceptable since the worker will not find
>    any new data to update busyness.
>
> v2: (Daniele)
> - Edit commit message and code comment
> - Use runtime pm in the worker
> - Put runtime pm after enabling the worker
> - Use Link tag and add Fixes tag
>
> v3: (Daniele)
> - Reword commit and comments and add details
>
> Link: https://gitlab.freedesktop.org/drm/intel/-/issues/7077
> Fixes: 77cdd054dd2c ("drm/i915/pmu: Connect engine busyness stats from GuC to pmu")
> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 38 +++++++++++++++++--
>   1 file changed, 35 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index cabdc645fcdd..ae3495a9c814 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1432,6 +1432,36 @@ static void guc_timestamp_ping(struct work_struct *wrk)
>   	unsigned long index;
>   	int srcu, ret;
>   
> +	/*
> +	 * Ideally the busyness worker should take a gt pm wakeref because the
> +	 * worker only needs to be active while gt is awake. However, the
> +	 * gt_park path cancels the worker synchronously and this complicates
> +	 * the flow if the worker is also running at the same time. The cancel
> +	 * waits for the worker and when the worker releases the wakeref, that
> +	 * would call gt_park and would lead to a deadlock.
> +	 *
> +	 * The resolution is to take the global pm wakeref if runtime pm is
> +	 * already active. If not, we don't need to update the busyness stats as
> +	 * the stats would already be updated when the gt was parked.
> +	 *
> +	 * Note:
> +	 * - We do not requeue the worker if we cannot take a reference to runtime
> +	 *   pm since intel_guc_busyness_unpark would requeue the worker in the
> +	 *   resume path.
> +	 *
> +	 * - If the gt was parked longer than time taken for GT timestamp to roll
> +	 *   over, we ignore those rollovers since we don't care about tracking
> +	 *   the exact GT time. We only care about roll overs when the gt is
> +	 *   active and running workloads.
> +	 *
> +	 * - There is a window of time between gt_park and runtime suspend,
> +	 *   where the worker may run. This is acceptable since the worker will
> +	 *   not find any new data to update busyness.
> +	 */
> +	wakeref = intel_runtime_pm_get_if_active(&gt->i915->runtime_pm);
> +	if (!wakeref)
> +		return;
> +
>   	/*
>   	 * Synchronize with gt reset to make sure the worker does not
>   	 * corrupt the engine/guc stats. NB: can't actually block waiting
> @@ -1440,10 +1470,9 @@ static void guc_timestamp_ping(struct work_struct *wrk)
>   	 */
>   	ret = intel_gt_reset_trylock(gt, &srcu);
>   	if (ret)
> -		return;
> +		goto err_trylock;
>   
> -	with_intel_runtime_pm(&gt->i915->runtime_pm, wakeref)
> -		__update_guc_busyness_stats(guc);
> +	__update_guc_busyness_stats(guc);
>   
>   	/* adjust context stats for overflow */
>   	xa_for_each(&guc->context_lookup, index, ce)
> @@ -1452,6 +1481,9 @@ static void guc_timestamp_ping(struct work_struct *wrk)
>   	intel_gt_reset_unlock(gt, srcu);
>   
>   	guc_enable_busyness_worker(guc);
> +
> +err_trylock:
> +	intel_runtime_pm_put(&gt->i915->runtime_pm, wakeref);
>   }
>   
>   static int guc_action_enable_usage_stats(struct intel_guc *guc)
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index cabdc645fcdd..ae3495a9c814 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1432,6 +1432,36 @@  static void guc_timestamp_ping(struct work_struct *wrk)
 	unsigned long index;
 	int srcu, ret;
 
+	/*
+	 * Ideally the busyness worker should take a gt pm wakeref because the
+	 * worker only needs to be active while gt is awake. However, the
+	 * gt_park path cancels the worker synchronously and this complicates
+	 * the flow if the worker is also running at the same time. The cancel
+	 * waits for the worker and when the worker releases the wakeref, that
+	 * would call gt_park and would lead to a deadlock.
+	 *
+	 * The resolution is to take the global pm wakeref if runtime pm is
+	 * already active. If not, we don't need to update the busyness stats as
+	 * the stats would already be updated when the gt was parked.
+	 *
+	 * Note:
+	 * - We do not requeue the worker if we cannot take a reference to runtime
+	 *   pm since intel_guc_busyness_unpark would requeue the worker in the
+	 *   resume path.
+	 *
+	 * - If the gt was parked longer than time taken for GT timestamp to roll
+	 *   over, we ignore those rollovers since we don't care about tracking
+	 *   the exact GT time. We only care about roll overs when the gt is
+	 *   active and running workloads.
+	 *
+	 * - There is a window of time between gt_park and runtime suspend,
+	 *   where the worker may run. This is acceptable since the worker will
+	 *   not find any new data to update busyness.
+	 */
+	wakeref = intel_runtime_pm_get_if_active(&gt->i915->runtime_pm);
+	if (!wakeref)
+		return;
+
 	/*
 	 * Synchronize with gt reset to make sure the worker does not
 	 * corrupt the engine/guc stats. NB: can't actually block waiting
@@ -1440,10 +1470,9 @@  static void guc_timestamp_ping(struct work_struct *wrk)
 	 */
 	ret = intel_gt_reset_trylock(gt, &srcu);
 	if (ret)
-		return;
+		goto err_trylock;
 
-	with_intel_runtime_pm(&gt->i915->runtime_pm, wakeref)
-		__update_guc_busyness_stats(guc);
+	__update_guc_busyness_stats(guc);
 
 	/* adjust context stats for overflow */
 	xa_for_each(&guc->context_lookup, index, ce)
@@ -1452,6 +1481,9 @@  static void guc_timestamp_ping(struct work_struct *wrk)
 	intel_gt_reset_unlock(gt, srcu);
 
 	guc_enable_busyness_worker(guc);
+
+err_trylock:
+	intel_runtime_pm_put(&gt->i915->runtime_pm, wakeref);
 }
 
 static int guc_action_enable_usage_stats(struct intel_guc *guc)