drm/i915/pmu: Fix zero delta busyness issue

Message ID	20250123193839.2394694-1-umesh.nerlige.ramappa@intel.com (mailing list archive)
State	New
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com> To: intel-gfx@lists.freedesktop.org Cc: John.C.Harrison@intel.com, rodrigo.vivi@intel.com Subject: [PATCH] drm/i915/pmu: Fix zero delta busyness issue Date: Thu, 23 Jan 2025 11:38:39 -0800 Message-Id: <20250123193839.2394694-1-umesh.nerlige.ramappa@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	drm/i915/pmu: Fix zero delta busyness issue \| expand drm/i915/pmu: Fix zero delta busyness issue

Message ID

20250123193839.2394694-1-umesh.nerlige.ramappa@intel.com (mailing list archive)

State

New

Headers

From: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
To: intel-gfx@lists.freedesktop.org
Cc: John.C.Harrison@intel.com,
	rodrigo.vivi@intel.com
Subject: [PATCH] drm/i915/pmu: Fix zero delta busyness issue
Date: Thu, 23 Jan 2025 11:38:39 -0800
Message-Id: <20250123193839.2394694-1-umesh.nerlige.ramappa@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: list
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Series

drm/i915/pmu: Fix zero delta busyness issue | expand

Commit Message

Umesh Nerlige Ramappa Jan. 23, 2025, 7:38 p.m. UTC

When running igt@gem_exec_balancer@individual for multiple iterations,
it is seen that the delta busyness returned by PMU is 0. The issue stems
from a combination of 2 implementation specific details:

1) gt_park is throttling __update_guc_busyness_stats() so that it does
not hog PCI bandwidth for some use cases. (Ref: 59bcdb564b3ba)

2) busyness implementation always returns monotonically increasing
counters. (Ref: cf907f6d29421)

If an application queried an engine while it was active,
engine->stats.guc.running is set to true. Following that, if all PM
wakeref's are released, then gt is parked. At this time the throttling
of __update_guc_busyness_stats() may result in a missed update to the
running state of the engine (due to (1) above). This means subsequent
calls to guc_engine_busyness() will think that the engine is still
running and they will keep updating the cached counter (stats->total).
This results in an inflated cached counter.

Later when the application runs a workload and queries for busyness, we
return the cached value since it is larger than the actual value (due to
(2) above)

All subsequent queries will return the same large (inflated) value, so
the application sees a delta busyness of zero.

Fix the issue by resetting the running state of engines each time
intel_guc_busyness_park() is called.

v2: (Rodrigo)
- Use the correct tag in commit message
- Drop the redundant wakeref check in guc_engine_busyness() and update
  commit message

Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/13366
Fixes: cf907f6d2942 ("i915/guc: Ensure busyness counter increases motonically")
Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c    | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

Comments

Umesh Nerlige Ramappa Jan. 24, 2025, 5:26 p.m. UTC | #1

On Fri, Jan 24, 2025 at 12:45:40PM +0000, Patchwork wrote:
>   Patch Details
>
>Series:  drm/i915/pmu: Fix zero delta busyness issue
>URL:     [1]https://patchwork.freedesktop.org/series/143900/
>State:   failure
>Details: [2]https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_143900v1/index.html
>
>     CI Bug Log - changes from CI_DRM_16011_full -> Patchwork_143900v1_full
>
>Summary
>
>   FAILURE
>
>   Serious unknown changes coming with Patchwork_143900v1_full absolutely
>   need to be
>   verified manually.
>
>   If you think the reported changes have nothing to do with the changes
>   introduced in Patchwork_143900v1_full, please notify your bug team
>   (I915-ci-infra@lists.freedesktop.org) to allow them
>   to document this new failure mode, which will reduce false positives in
>   CI.
>
>Participating hosts (11 -> 12)
>
>   Additional (1): pig-kbl-iris
>
>Possible new issues
>
>   Here are the unknown changes that may have been introduced in
>   Patchwork_143900v1_full:
>
>  IGT changes
>
>    Possible regressions
>
>     * igt@gem_eio@in-flight-internal-immediate:
>
>          * shard-mtlp: [3]PASS -> [4]ABORT
>
>     * igt@gem_workarounds@suspend-resume-context:
>
>          * shard-dg1: [5]PASS -> [6]ABORT
>

These don't appear to be related to this series.

Umesh

Rodrigo Vivi Jan. 24, 2025, 6:06 p.m. UTC | #2

On Thu, Jan 23, 2025 at 11:38:39AM -0800, Umesh Nerlige Ramappa wrote:
> When running igt@gem_exec_balancer@individual for multiple iterations,
> it is seen that the delta busyness returned by PMU is 0. The issue stems
> from a combination of 2 implementation specific details:
> 
> 1) gt_park is throttling __update_guc_busyness_stats() so that it does
> not hog PCI bandwidth for some use cases. (Ref: 59bcdb564b3ba)
> 
> 2) busyness implementation always returns monotonically increasing
> counters. (Ref: cf907f6d29421)
> 
> If an application queried an engine while it was active,
> engine->stats.guc.running is set to true. Following that, if all PM
> wakeref's are released, then gt is parked. At this time the throttling
> of __update_guc_busyness_stats() may result in a missed update to the
> running state of the engine (due to (1) above). This means subsequent
> calls to guc_engine_busyness() will think that the engine is still
> running and they will keep updating the cached counter (stats->total).
> This results in an inflated cached counter.
> 
> Later when the application runs a workload and queries for busyness, we
> return the cached value since it is larger than the actual value (due to
> (2) above)
> 
> All subsequent queries will return the same large (inflated) value, so
> the application sees a delta busyness of zero.
> 
> Fix the issue by resetting the running state of engines each time
> intel_guc_busyness_park() is called.
> 
> v2: (Rodrigo)
> - Use the correct tag in commit message
> - Drop the redundant wakeref check in guc_engine_busyness() and update
>   commit message

Thank you

> 
> Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/13366
> Fixes: cf907f6d2942 ("i915/guc: Ensure busyness counter increases motonically")
> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> ---
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c    | 16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 3b1333a24a89..a33b67b83dc1 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1469,6 +1469,19 @@ static void __reset_guc_busyness_stats(struct intel_guc *guc)
>  	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>  }
>  
> +static void __update_guc_busyness_running_state(struct intel_guc *guc)

I hate those '__', but I forgot to spot in the previous email and
I know I know... there's a lot of those in this file already :/

I was thinking of another name for this function as well since
it is only stopping the running state, but I'm bad with naming...

let's move on and close the real issue

Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

> +{
> +	struct intel_gt *gt = guc_to_gt(guc);
> +	struct intel_engine_cs *engine;
> +	enum intel_engine_id id;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
> +	for_each_engine(engine, gt, id)
> +		engine->stats.guc.running = false;
> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
> +}
> +
>  static void __update_guc_busyness_stats(struct intel_guc *guc)
>  {
>  	struct intel_gt *gt = guc_to_gt(guc);
> @@ -1619,6 +1632,9 @@ void intel_guc_busyness_park(struct intel_gt *gt)
>  	if (!guc_submission_initialized(guc))
>  		return;
>  
> +	/* Assume no engines are running and set running state to false */
> +	__update_guc_busyness_running_state(guc);
> +
>  	/*
>  	 * There is a race with suspend flow where the worker runs after suspend
>  	 * and causes an unclaimed register access warning. Cancel the worker
> -- 
> 2.38.1
>

Umesh Nerlige Ramappa Jan. 24, 2025, 6:35 p.m. UTC | #3

On Fri, Jan 24, 2025 at 01:06:08PM -0500, Rodrigo Vivi wrote:
>On Thu, Jan 23, 2025 at 11:38:39AM -0800, Umesh Nerlige Ramappa wrote:
>> When running igt@gem_exec_balancer@individual for multiple iterations,
>> it is seen that the delta busyness returned by PMU is 0. The issue stems
>> from a combination of 2 implementation specific details:
>>
>> 1) gt_park is throttling __update_guc_busyness_stats() so that it does
>> not hog PCI bandwidth for some use cases. (Ref: 59bcdb564b3ba)
>>
>> 2) busyness implementation always returns monotonically increasing
>> counters. (Ref: cf907f6d29421)
>>
>> If an application queried an engine while it was active,
>> engine->stats.guc.running is set to true. Following that, if all PM
>> wakeref's are released, then gt is parked. At this time the throttling
>> of __update_guc_busyness_stats() may result in a missed update to the
>> running state of the engine (due to (1) above). This means subsequent
>> calls to guc_engine_busyness() will think that the engine is still
>> running and they will keep updating the cached counter (stats->total).
>> This results in an inflated cached counter.
>>
>> Later when the application runs a workload and queries for busyness, we
>> return the cached value since it is larger than the actual value (due to
>> (2) above)
>>
>> All subsequent queries will return the same large (inflated) value, so
>> the application sees a delta busyness of zero.
>>
>> Fix the issue by resetting the running state of engines each time
>> intel_guc_busyness_park() is called.
>>
>> v2: (Rodrigo)
>> - Use the correct tag in commit message
>> - Drop the redundant wakeref check in guc_engine_busyness() and update
>>   commit message
>
>Thank you
>
>>
>> Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/13366
>> Fixes: cf907f6d2942 ("i915/guc: Ensure busyness counter increases motonically")
>> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>> ---
>>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c    | 16 ++++++++++++++++
>>  1 file changed, 16 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index 3b1333a24a89..a33b67b83dc1 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -1469,6 +1469,19 @@ static void __reset_guc_busyness_stats(struct intel_guc *guc)
>>  	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>>  }
>>
>> +static void __update_guc_busyness_running_state(struct intel_guc *guc)
>
>I hate those '__', but I forgot to spot in the previous email and
>I know I know... there's a lot of those in this file already :/

Agree, I added it this time to keep the names consistent with existing 
mess (which was also created by me!). Otherwise, as per other reviewers' 
comments, I don't use them.

Some tips on when to use '__' would help.

>
>I was thinking of another name for this function as well since
>it is only stopping the running state, but I'm bad with naming...
>
>let's move on and close the real issue
>
>Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

Thanks,
Umesh
>
>> +{
>> +	struct intel_gt *gt = guc_to_gt(guc);
>> +	struct intel_engine_cs *engine;
>> +	enum intel_engine_id id;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&guc->timestamp.lock, flags);
>> +	for_each_engine(engine, gt, id)
>> +		engine->stats.guc.running = false;
>> +	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
>> +}
>> +
>>  static void __update_guc_busyness_stats(struct intel_guc *guc)
>>  {
>>  	struct intel_gt *gt = guc_to_gt(guc);
>> @@ -1619,6 +1632,9 @@ void intel_guc_busyness_park(struct intel_gt *gt)
>>  	if (!guc_submission_initialized(guc))
>>  		return;
>>
>> +	/* Assume no engines are running and set running state to false */
>> +	__update_guc_busyness_running_state(guc);
>> +
>>  	/*
>>  	 * There is a race with suspend flow where the worker runs after suspend
>>  	 * and causes an unclaimed register access warning. Cancel the worker
>> --
>> 2.38.1
>>

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 3b1333a24a89..a33b67b83dc1 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1469,6 +1469,19 @@  static void __reset_guc_busyness_stats(struct intel_guc *guc)
 	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
 }
 
+static void __update_guc_busyness_running_state(struct intel_guc *guc)
+{
+	struct intel_gt *gt = guc_to_gt(guc);
+	struct intel_engine_cs *engine;
+	enum intel_engine_id id;
+	unsigned long flags;
+
+	spin_lock_irqsave(&guc->timestamp.lock, flags);
+	for_each_engine(engine, gt, id)
+		engine->stats.guc.running = false;
+	spin_unlock_irqrestore(&guc->timestamp.lock, flags);
+}
+
 static void __update_guc_busyness_stats(struct intel_guc *guc)
 {
 	struct intel_gt *gt = guc_to_gt(guc);
@@ -1619,6 +1632,9 @@  void intel_guc_busyness_park(struct intel_gt *gt)
 	if (!guc_submission_initialized(guc))
 		return;
 
+	/* Assume no engines are running and set running state to false */
+	__update_guc_busyness_running_state(guc);
+
 	/*
 	 * There is a race with suspend flow where the worker runs after suspend
 	 * and causes an unclaimed register access warning. Cancel the worker