[v2,6/6] drm/xe/client: Print runtime to fdinfo

Message ID	20240423235652.1959945-7-lucas.demarchi@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Lucas De Marchi <lucas.demarchi@intel.com> To: <intel-xe@lists.freedesktop.org> Cc: dri-devel@lists.freedesktop.org, Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>, Tvrtko Ursulin <tursulin@ursulin.net>, Lucas De Marchi <lucas.demarchi@intel.com> Subject: [PATCH v2 6/6] drm/xe/client: Print runtime to fdinfo Date: Tue, 23 Apr 2024 16:56:51 -0700 Message-ID: <20240423235652.1959945-7-lucas.demarchi@intel.com> In-Reply-To: <20240423235652.1959945-1-lucas.demarchi@intel.com> References: <20240423235652.1959945-1-lucas.demarchi@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	drm/xe: Per client usage \| expand [v2,0/6] drm/xe: Per client usage [v2,1/6] drm/xe/lrc: Add helper to capture context timestamp [v2,2/6] drm/xe: Add helper to capture engine timestamp [v2,3/6] drm/xe: Add helper to accumulate exec queue runtime [v2,4/6] drm/xe: Promote xe_hw_engine_class_to_str() [v2,5/6] drm/xe: Add XE_ENGINE_CLASS_OTHER to str conversion [v2,6/6] drm/xe/client: Print runtime to fdinfo

Lucas De Marchi April 23, 2024, 11:56 p.m. UTC

Print the accumulated runtime for client when printing fdinfo.
Each time a query is done it first does 2 things:

1) loop through all the exec queues for the current client and
   accumulate the runtime, per engine class. CTX_TIMESTAMP is used for
   that, being read from the context image.

2) Read a "GPU timestamp" that can be used for considering "how much GPU
   time has passed" and that has the same unit/refclock as the one
   recording the runtime. RING_TIMESTAMP is used for that via MMIO.

Since for all current platforms RING_TIMESTAMP follows the same
refclock, just read it once, using any first engine.

This is exported to userspace as 2 numbers in fdinfo:

	drm-cycles-<class>: <RUNTIME>
	drm-total-cycles-<class>: <TIMESTAMP>

Userspace is expected to collect at least 2 samples, which allows to
know the client engine busyness as per:

		    RUNTIME1 - RUNTIME0
	busyness = ---------------------
			  T1 - T0

Another thing to point out is that it's expected that userspace will
read any 2 samples every few seconds.  Given the update frequency of the
counters involved and that CTX_TIMESTAMP is 32-bits, the counter for
each exec_queue can wrap around (assuming 100% utilization) after ~200s.
The wraparound is not perceived by userspace since it's just accumulated
for all the exec_queues in a 64-bit counter), but the measurement will
not be accurate if the samples are too far apart.

This could be mitigated by adding a workqueue to accumulate the counters
every so often, but it's additional complexity for something that is
done already by userspace every few seconds in tools like gputop (from
igt), htop, nvtop, etc with none of them really defaulting to 1 sample
per minute or more.

Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
---
 Documentation/gpu/drm-usage-stats.rst       |  16 ++-
 Documentation/gpu/xe/index.rst              |   1 +
 Documentation/gpu/xe/xe-drm-usage-stats.rst |  10 ++
 drivers/gpu/drm/xe/xe_drm_client.c          | 138 +++++++++++++++++++-
 4 files changed, 162 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/gpu/xe/xe-drm-usage-stats.rst

Tvrtko Ursulin April 26, 2024, 10:47 a.m. UTC | #1

On 24/04/2024 00:56, Lucas De Marchi wrote:
> Print the accumulated runtime for client when printing fdinfo.
> Each time a query is done it first does 2 things:
> 
> 1) loop through all the exec queues for the current client and
>     accumulate the runtime, per engine class. CTX_TIMESTAMP is used for
>     that, being read from the context image.
> 
> 2) Read a "GPU timestamp" that can be used for considering "how much GPU
>     time has passed" and that has the same unit/refclock as the one
>     recording the runtime. RING_TIMESTAMP is used for that via MMIO.
> 
> Since for all current platforms RING_TIMESTAMP follows the same
> refclock, just read it once, using any first engine.
> 
> This is exported to userspace as 2 numbers in fdinfo:
> 
> 	drm-cycles-<class>: <RUNTIME>
> 	drm-total-cycles-<class>: <TIMESTAMP>
> 
> Userspace is expected to collect at least 2 samples, which allows to
> know the client engine busyness as per:
> 
> 		    RUNTIME1 - RUNTIME0
> 	busyness = ---------------------
> 			  T1 - T0
> 
> Another thing to point out is that it's expected that userspace will
> read any 2 samples every few seconds.  Given the update frequency of the
> counters involved and that CTX_TIMESTAMP is 32-bits, the counter for
> each exec_queue can wrap around (assuming 100% utilization) after ~200s.
> The wraparound is not perceived by userspace since it's just accumulated
> for all the exec_queues in a 64-bit counter), but the measurement will
> not be accurate if the samples are too far apart.
> 
> This could be mitigated by adding a workqueue to accumulate the counters
> every so often, but it's additional complexity for something that is
> done already by userspace every few seconds in tools like gputop (from
> igt), htop, nvtop, etc with none of them really defaulting to 1 sample
> per minute or more.
> 
> Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
> ---
>   Documentation/gpu/drm-usage-stats.rst       |  16 ++-
>   Documentation/gpu/xe/index.rst              |   1 +
>   Documentation/gpu/xe/xe-drm-usage-stats.rst |  10 ++
>   drivers/gpu/drm/xe/xe_drm_client.c          | 138 +++++++++++++++++++-
>   4 files changed, 162 insertions(+), 3 deletions(-)
>   create mode 100644 Documentation/gpu/xe/xe-drm-usage-stats.rst
> 
> diff --git a/Documentation/gpu/drm-usage-stats.rst b/Documentation/gpu/drm-usage-stats.rst
> index 6dc299343b48..421766289b78 100644
> --- a/Documentation/gpu/drm-usage-stats.rst
> +++ b/Documentation/gpu/drm-usage-stats.rst
> @@ -112,6 +112,17 @@ larger value within a reasonable period. Upon observing a value lower than what
>   was previously read, userspace is expected to stay with that larger previous
>   value until a monotonic update is seen.
>   
> +- drm-total-cycles-<keystr>: <uint>
> +
> +Engine identifier string must be the same as the one specified in the
> +drm-cycles-<keystr> tag and shall contain the total number cycles for the given
> +engine.
> +
> +This is a timestamp in GPU unspecified unit that matches the update rate
> +of drm-cycles-<keystr>. For drivers that implement this interface, the engine
> +utilization can be calculated entirely on the GPU clock domain, without
> +considering the CPU sleep time between 2 samples.

Two opens.

1)
Do we need to explicity document that drm-total-cycles and drm-maxfreq 
are mutually exclusive?

2)
Should drm-total-cycles for now be documents as driver specific?

I have added some more poeple in the cc who were involved with driver 
fdinfo implementations if they will have an opinion.

I would say potentially yes, and promote it to common if more than one 
driver would use it.

For instance I see panfrost has the driver specific drm-curfreq 
(although isn't documenting it fully in panfrost.rst). And I have to say 
it is somewhat questionable to expose the current frequency per fdinfo 
per engine but not my call.

> +
>   - drm-maxfreq-<keystr>: <uint> [Hz|MHz|KHz]
>   
>   Engine identifier string must be the same as the one specified in the
> @@ -168,5 +179,6 @@ be documented above and where possible, aligned with other drivers.
>   Driver specific implementations
>   -------------------------------
>   
> -:ref:`i915-usage-stats`
> -:ref:`panfrost-usage-stats`
> +* :ref:`i915-usage-stats`
> +* :ref:`panfrost-usage-stats`
> +* :ref:`xe-usage-stats`
> diff --git a/Documentation/gpu/xe/index.rst b/Documentation/gpu/xe/index.rst
> index c224ecaee81e..3f07aa3b5432 100644
> --- a/Documentation/gpu/xe/index.rst
> +++ b/Documentation/gpu/xe/index.rst
> @@ -23,3 +23,4 @@ DG2, etc is provided to prototype the driver.
>      xe_firmware
>      xe_tile
>      xe_debugging
> +   xe-drm-usage-stats.rst
> diff --git a/Documentation/gpu/xe/xe-drm-usage-stats.rst b/Documentation/gpu/xe/xe-drm-usage-stats.rst
> new file mode 100644
> index 000000000000..ccb48733cbe3
> --- /dev/null
> +++ b/Documentation/gpu/xe/xe-drm-usage-stats.rst
> @@ -0,0 +1,10 @@
> +.. SPDX-License-Identifier: GPL-2.0+
> +
> +.. _xe-usage-stats:
> +
> +=======================================
> +Xe DRM client usage stats implemenation
> +=======================================
> +
> +.. kernel-doc:: drivers/gpu/drm/xe/xe_drm_client.c
> +   :doc: DRM Client usage stats
> diff --git a/drivers/gpu/drm/xe/xe_drm_client.c b/drivers/gpu/drm/xe/xe_drm_client.c
> index 08f0b7c95901..0227383910fa 100644
> --- a/drivers/gpu/drm/xe/xe_drm_client.c
> +++ b/drivers/gpu/drm/xe/xe_drm_client.c
> @@ -2,6 +2,7 @@
>   /*
>    * Copyright © 2023 Intel Corporation
>    */
> +#include "xe_drm_client.h"
>   
>   #include <drm/drm_print.h>
>   #include <drm/xe_drm.h>
> @@ -12,9 +13,66 @@
>   #include "xe_bo.h"
>   #include "xe_bo_types.h"
>   #include "xe_device_types.h"
> -#include "xe_drm_client.h"
> +#include "xe_exec_queue.h"
> +#include "xe_gt.h"
> +#include "xe_hw_engine.h"
> +#include "xe_pm.h"
>   #include "xe_trace.h"
>   
> +/**
> + * DOC: DRM Client usage stats
> + *
> + * The drm/xe driver implements the DRM client usage stats specification as
> + * documented in :ref:`drm-client-usage-stats`.
> + *
> + * Example of the output showing the implemented key value pairs and entirety of
> + * the currently possible format options:
> + *
> + * ::
> + *
> + * 	pos:    0
> + * 	flags:  0100002
> + * 	mnt_id: 26
> + * 	ino:    685
> + * 	drm-driver:     xe
> + * 	drm-client-id:  3
> + * 	drm-pdev:       0000:03:00.0
> + * 	drm-total-system:       0
> + * 	drm-shared-system:      0
> + * 	drm-active-system:      0
> + * 	drm-resident-system:    0
> + * 	drm-purgeable-system:   0
> + * 	drm-total-gtt:  192 KiB
> + * 	drm-shared-gtt: 0
> + * 	drm-active-gtt: 0
> + * 	drm-resident-gtt:       192 KiB
> + * 	drm-total-vram0:        23992 KiB
> + * 	drm-shared-vram0:       16 MiB
> + * 	drm-active-vram0:       0
> + * 	drm-resident-vram0:     23992 KiB
> + * 	drm-total-stolen:       0
> + * 	drm-shared-stolen:      0
> + * 	drm-active-stolen:      0
> + * 	drm-resident-stolen:    0
> + * 	drm-cycles-rcs: 28257900
> + * 	drm-total-cycles-rcs:   7655183225
> + * 	drm-cycles-bcs: 0
> + * 	drm-total-cycles-bcs:   7655183225
> + * 	drm-cycles-vcs: 0
> + * 	drm-total-cycles-vcs:   7655183225
> + * 	drm-engine-capacity-vcs:        2
> + * 	drm-cycles-vecs:        0
> + * 	drm-total-cycles-vecs:  7655183225
> + * 	drm-engine-capacity-vecs:       2
> + * 	drm-cycles-ccs: 0
> + * 	drm-total-cycles-ccs:   7655183225
> + * 	drm-engine-capacity-ccs:        4
> + *
> + * Possible `drm-cycles-` key names are: `rcs`, `ccs`, `bcs`, `vcs`, `vecs` and
> + * "other".
> + */
> +
> +
>   /**
>    * xe_drm_client_alloc() - Allocate drm client
>    * @void: No arg
> @@ -179,6 +237,83 @@ static void show_meminfo(struct drm_printer *p, struct drm_file *file)
>   	}
>   }
>   
> +static const u64 class_to_mask[] = {
> +        [XE_ENGINE_CLASS_RENDER] = XE_HW_ENGINE_RCS_MASK,
> +        [XE_ENGINE_CLASS_VIDEO_DECODE] = XE_HW_ENGINE_VCS_MASK,
> +        [XE_ENGINE_CLASS_VIDEO_ENHANCE] = XE_HW_ENGINE_VECS_MASK,
> +        [XE_ENGINE_CLASS_COPY] = XE_HW_ENGINE_BCS_MASK,
> +        [XE_ENGINE_CLASS_OTHER] = XE_HW_ENGINE_GSCCS_MASK,
> +        [XE_ENGINE_CLASS_COMPUTE] = XE_HW_ENGINE_CCS_MASK,
> +};
> +
> +static void show_runtime(struct drm_printer *p, struct drm_file *file)
> +{
> +	struct xe_file *xef = file->driver_priv;
> +	struct xe_device *xe = xef->xe;
> +	struct xe_gt *gt;
> +	struct xe_hw_engine *hwe;
> +	struct xe_exec_queue *q;
> +	unsigned long i, id_hwe, id_gt, capacity[XE_ENGINE_CLASS_MAX] = { };
> +	u64 gpu_timestamp, engine_mask = 0;
> +	bool gpu_stamp = false;
> +
> +	xe_pm_runtime_get(xe);
> +
> +	/* Accumulate all the exec queues from this client */
> +	mutex_lock(&xef->exec_queue.lock);
> +	xa_for_each(&xef->exec_queue.xa, i, q)
> +		xe_exec_queue_update_runtime(q);
> +	mutex_unlock(&xef->exec_queue.lock);
> +
> +
> +	/* Calculate capacity of each engine class */
> +	BUILD_BUG_ON(ARRAY_SIZE(class_to_mask) != XE_ENGINE_CLASS_MAX);
> +	for_each_gt(gt, xe, id_gt)
> +		engine_mask |= gt->info.engine_mask;
> +	for (i = 0; i < XE_ENGINE_CLASS_MAX; i++)
> +		capacity[i] = hweight64(engine_mask & class_to_mask[i]);

FWIW the above two loops are static so could store capacity in struct 
xe_device.

> +
> +	/*
> +	 * Iterate over all engines, printing the accumulated
> +	 * runtime for this client, per engine class
> +	 */
> +	for_each_gt(gt, xe, id_gt) {
> +		xe_force_wake_get(gt_to_fw(gt), XE_FW_GT);
> +		for_each_hw_engine(hwe, gt, id_hwe) {
> +			const char *class_name;
> +
> +			if (!capacity[hwe->class])
> +				continue;
> +
> +			/*
> +			 * Use any (first) engine to have a timestamp to be used every
> +			 * time
> +			 */
> +			if (!gpu_stamp) {
> +				gpu_timestamp = xe_hw_engine_read_timestamp(hwe);
> +				gpu_stamp = true;
> +			}
> +
> +			class_name = xe_hw_engine_class_to_str(hwe->class);
> +
> +			drm_printf(p, "drm-cycles-%s:\t%llu\n",
> +				   class_name, xef->runtime[hwe->class]);
> +			drm_printf(p, "drm-total-cycles-%s:\t%llu\n",
> +				   class_name, gpu_timestamp);
> +
> +			if (capacity[hwe->class] > 1)
> +				drm_printf(p, "drm-engine-capacity-%s:\t%lu\n",
> +					   class_name, capacity[hwe->class]);
> +
> +			/* engine class already handled, skip next iterations */
> +			capacity[hwe->class] = 0;
> +		}
> +		xe_force_wake_put(gt_to_fw(gt), XE_FW_GT);
> +	}

More FWIW and AFAICT, could just walk the "list" of classes instead of 
the nested loop with skipping already visited classes. Assuming in xe 
there is an easy way to get a known present engine for the gpu_timestamp 
it would be flatter and less code.

Regards,

Tvrtko

> +
> +	xe_pm_runtime_get(xe);
> +}
> +
>   /**
>    * xe_drm_client_fdinfo() - Callback for fdinfo interface
>    * @p: The drm_printer ptr
> @@ -192,5 +327,6 @@ static void show_meminfo(struct drm_printer *p, struct drm_file *file)
>   void xe_drm_client_fdinfo(struct drm_printer *p, struct drm_file *file)
>   {
>   	show_meminfo(p, file);
> +	show_runtime(p, file);
>   }
>   #endif

Lucas De Marchi May 7, 2024, 9:35 p.m. UTC | #2

On Fri, Apr 26, 2024 at 11:47:37AM GMT, Tvrtko Ursulin wrote:
>
>On 24/04/2024 00:56, Lucas De Marchi wrote:
>>Print the accumulated runtime for client when printing fdinfo.
>>Each time a query is done it first does 2 things:
>>
>>1) loop through all the exec queues for the current client and
>>    accumulate the runtime, per engine class. CTX_TIMESTAMP is used for
>>    that, being read from the context image.
>>
>>2) Read a "GPU timestamp" that can be used for considering "how much GPU
>>    time has passed" and that has the same unit/refclock as the one
>>    recording the runtime. RING_TIMESTAMP is used for that via MMIO.
>>
>>Since for all current platforms RING_TIMESTAMP follows the same
>>refclock, just read it once, using any first engine.
>>
>>This is exported to userspace as 2 numbers in fdinfo:
>>
>>	drm-cycles-<class>: <RUNTIME>
>>	drm-total-cycles-<class>: <TIMESTAMP>
>>
>>Userspace is expected to collect at least 2 samples, which allows to
>>know the client engine busyness as per:
>>
>>		    RUNTIME1 - RUNTIME0
>>	busyness = ---------------------
>>			  T1 - T0
>>
>>Another thing to point out is that it's expected that userspace will
>>read any 2 samples every few seconds.  Given the update frequency of the
>>counters involved and that CTX_TIMESTAMP is 32-bits, the counter for
>>each exec_queue can wrap around (assuming 100% utilization) after ~200s.
>>The wraparound is not perceived by userspace since it's just accumulated
>>for all the exec_queues in a 64-bit counter), but the measurement will
>>not be accurate if the samples are too far apart.
>>
>>This could be mitigated by adding a workqueue to accumulate the counters
>>every so often, but it's additional complexity for something that is
>>done already by userspace every few seconds in tools like gputop (from
>>igt), htop, nvtop, etc with none of them really defaulting to 1 sample
>>per minute or more.
>>
>>Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
>>---
>>  Documentation/gpu/drm-usage-stats.rst       |  16 ++-
>>  Documentation/gpu/xe/index.rst              |   1 +
>>  Documentation/gpu/xe/xe-drm-usage-stats.rst |  10 ++
>>  drivers/gpu/drm/xe/xe_drm_client.c          | 138 +++++++++++++++++++-
>>  4 files changed, 162 insertions(+), 3 deletions(-)
>>  create mode 100644 Documentation/gpu/xe/xe-drm-usage-stats.rst
>>
>>diff --git a/Documentation/gpu/drm-usage-stats.rst b/Documentation/gpu/drm-usage-stats.rst
>>index 6dc299343b48..421766289b78 100644
>>--- a/Documentation/gpu/drm-usage-stats.rst
>>+++ b/Documentation/gpu/drm-usage-stats.rst
>>@@ -112,6 +112,17 @@ larger value within a reasonable period. Upon observing a value lower than what
>>  was previously read, userspace is expected to stay with that larger previous
>>  value until a monotonic update is seen.
>>+- drm-total-cycles-<keystr>: <uint>
>>+
>>+Engine identifier string must be the same as the one specified in the
>>+drm-cycles-<keystr> tag and shall contain the total number cycles for the given
>>+engine.
>>+
>>+This is a timestamp in GPU unspecified unit that matches the update rate
>>+of drm-cycles-<keystr>. For drivers that implement this interface, the engine
>>+utilization can be calculated entirely on the GPU clock domain, without
>>+considering the CPU sleep time between 2 samples.
>
>Two opens.
>
>1)
>Do we need to explicity document that drm-total-cycles and drm-maxfreq 
>are mutually exclusive?

so userspace has a fallback mechanism to calculate utilization depending
on what keys are available?

>
>2)
>Should drm-total-cycles for now be documents as driver specific?

you mean to call it xe-total-cycles?

>
>I have added some more poeple in the cc who were involved with driver 
>fdinfo implementations if they will have an opinion.
>
>I would say potentially yes, and promote it to common if more than one 
>driver would use it.
>
>For instance I see panfrost has the driver specific drm-curfreq 
>(although isn't documenting it fully in panfrost.rst). And I have to 
>say it is somewhat questionable to expose the current frequency per 
>fdinfo per engine but not my call.

aren't all of Documentation/gpu/drm-usage-stats.rst optional that
driver may or may not implement? When you say driver-specific I'd think
more of the ones not using <drm> as prefix as e.g. amd-*.

I think drm-cycles + drm-total-cycles is just an alternative
implementation for engine utilization. Like drm-cycles + drm-maxfreq
already is an alternative to drm-engine and is not implemented by e.g.
amdgpu/i915.

I will submit a new version of the entire patch series to get the ball
rolling, but let's keep this open for now.

<...>

>>+static void show_runtime(struct drm_printer *p, struct drm_file *file)
>>+{
>>+	struct xe_file *xef = file->driver_priv;
>>+	struct xe_device *xe = xef->xe;
>>+	struct xe_gt *gt;
>>+	struct xe_hw_engine *hwe;
>>+	struct xe_exec_queue *q;
>>+	unsigned long i, id_hwe, id_gt, capacity[XE_ENGINE_CLASS_MAX] = { };
>>+	u64 gpu_timestamp, engine_mask = 0;
>>+	bool gpu_stamp = false;
>>+
>>+	xe_pm_runtime_get(xe);
>>+
>>+	/* Accumulate all the exec queues from this client */
>>+	mutex_lock(&xef->exec_queue.lock);
>>+	xa_for_each(&xef->exec_queue.xa, i, q)
>>+		xe_exec_queue_update_runtime(q);
>>+	mutex_unlock(&xef->exec_queue.lock);
>>+
>>+
>>+	/* Calculate capacity of each engine class */
>>+	BUILD_BUG_ON(ARRAY_SIZE(class_to_mask) != XE_ENGINE_CLASS_MAX);
>>+	for_each_gt(gt, xe, id_gt)
>>+		engine_mask |= gt->info.engine_mask;
>>+	for (i = 0; i < XE_ENGINE_CLASS_MAX; i++)
>>+		capacity[i] = hweight64(engine_mask & class_to_mask[i]);
>
>FWIW the above two loops are static so could store capacity in struct 
>xe_device.

yes, but just creating a cache in xe of something derived from gt is not
something to consider lightly. Particularly considering the small number
of xe->info.gt_count we have. For something that runs only when someone
cat the fdinfo, this doesn't seem terrible.

>
>>+
>>+	/*
>>+	 * Iterate over all engines, printing the accumulated
>>+	 * runtime for this client, per engine class
>>+	 */
>>+	for_each_gt(gt, xe, id_gt) {
>>+		xe_force_wake_get(gt_to_fw(gt), XE_FW_GT);
>>+		for_each_hw_engine(hwe, gt, id_hwe) {
>>+			const char *class_name;
>>+
>>+			if (!capacity[hwe->class])
>>+				continue;
>>+
>>+			/*
>>+			 * Use any (first) engine to have a timestamp to be used every
>>+			 * time
>>+			 */
>>+			if (!gpu_stamp) {
>>+				gpu_timestamp = xe_hw_engine_read_timestamp(hwe);
>>+				gpu_stamp = true;
>>+			}
>>+
>>+			class_name = xe_hw_engine_class_to_str(hwe->class);
>>+
>>+			drm_printf(p, "drm-cycles-%s:\t%llu\n",
>>+				   class_name, xef->runtime[hwe->class]);
>>+			drm_printf(p, "drm-total-cycles-%s:\t%llu\n",
>>+				   class_name, gpu_timestamp);
>>+
>>+			if (capacity[hwe->class] > 1)
>>+				drm_printf(p, "drm-engine-capacity-%s:\t%lu\n",
>>+					   class_name, capacity[hwe->class]);
>>+
>>+			/* engine class already handled, skip next iterations */
>>+			capacity[hwe->class] = 0;
>>+		}
>>+		xe_force_wake_put(gt_to_fw(gt), XE_FW_GT);
>>+	}
>
>More FWIW and AFAICT, could just walk the "list" of classes instead of 

xe_force_wake_get() is per gt, so the alternative would be... loop
through the gts to get all forcewakes, loop through all engine classes, loop
again through all gts to put the forcewake. And we also need to consider
that an engine class may not be available in all GTs... example:
vcs/vecs in MTL and later, so we need to track it globally across GTs
anyway.

>the nested loop with skipping already visited classes. Assuming in xe 
>there is an easy way to get a known present engine for the 
>gpu_timestamp it would be flatter and less code.

from the device we can get either tile or gt. To work with the
hw_engines we have to look inside the gt.

Lucas De Marchi

>
>Regards,
>
>Tvrtko
>
>>+
>>+	xe_pm_runtime_get(xe);
>>+}
>>+
>>  /**
>>   * xe_drm_client_fdinfo() - Callback for fdinfo interface
>>   * @p: The drm_printer ptr
>>@@ -192,5 +327,6 @@ static void show_meminfo(struct drm_printer *p, struct drm_file *file)
>>  void xe_drm_client_fdinfo(struct drm_printer *p, struct drm_file *file)
>>  {
>>  	show_meminfo(p, file);
>>+	show_runtime(p, file);
>>  }
>>  #endif

Tvrtko Ursulin May 8, 2024, 8:23 a.m. UTC | #3

On 07/05/2024 22:35, Lucas De Marchi wrote:
> On Fri, Apr 26, 2024 at 11:47:37AM GMT, Tvrtko Ursulin wrote:
>>
>> On 24/04/2024 00:56, Lucas De Marchi wrote:
>>> Print the accumulated runtime for client when printing fdinfo.
>>> Each time a query is done it first does 2 things:
>>>
>>> 1) loop through all the exec queues for the current client and
>>>    accumulate the runtime, per engine class. CTX_TIMESTAMP is used for
>>>    that, being read from the context image.
>>>
>>> 2) Read a "GPU timestamp" that can be used for considering "how much GPU
>>>    time has passed" and that has the same unit/refclock as the one
>>>    recording the runtime. RING_TIMESTAMP is used for that via MMIO.
>>>
>>> Since for all current platforms RING_TIMESTAMP follows the same
>>> refclock, just read it once, using any first engine.
>>>
>>> This is exported to userspace as 2 numbers in fdinfo:
>>>
>>>     drm-cycles-<class>: <RUNTIME>
>>>     drm-total-cycles-<class>: <TIMESTAMP>
>>>
>>> Userspace is expected to collect at least 2 samples, which allows to
>>> know the client engine busyness as per:
>>>
>>>             RUNTIME1 - RUNTIME0
>>>     busyness = ---------------------
>>>               T1 - T0
>>>
>>> Another thing to point out is that it's expected that userspace will
>>> read any 2 samples every few seconds.  Given the update frequency of the
>>> counters involved and that CTX_TIMESTAMP is 32-bits, the counter for
>>> each exec_queue can wrap around (assuming 100% utilization) after ~200s.
>>> The wraparound is not perceived by userspace since it's just accumulated
>>> for all the exec_queues in a 64-bit counter), but the measurement will
>>> not be accurate if the samples are too far apart.
>>>
>>> This could be mitigated by adding a workqueue to accumulate the counters
>>> every so often, but it's additional complexity for something that is
>>> done already by userspace every few seconds in tools like gputop (from
>>> igt), htop, nvtop, etc with none of them really defaulting to 1 sample
>>> per minute or more.
>>>
>>> Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
>>> ---
>>>  Documentation/gpu/drm-usage-stats.rst       |  16 ++-
>>>  Documentation/gpu/xe/index.rst              |   1 +
>>>  Documentation/gpu/xe/xe-drm-usage-stats.rst |  10 ++
>>>  drivers/gpu/drm/xe/xe_drm_client.c          | 138 +++++++++++++++++++-
>>>  4 files changed, 162 insertions(+), 3 deletions(-)
>>>  create mode 100644 Documentation/gpu/xe/xe-drm-usage-stats.rst
>>>
>>> diff --git a/Documentation/gpu/drm-usage-stats.rst 
>>> b/Documentation/gpu/drm-usage-stats.rst
>>> index 6dc299343b48..421766289b78 100644
>>> --- a/Documentation/gpu/drm-usage-stats.rst
>>> +++ b/Documentation/gpu/drm-usage-stats.rst
>>> @@ -112,6 +112,17 @@ larger value within a reasonable period. Upon 
>>> observing a value lower than what
>>>  was previously read, userspace is expected to stay with that larger 
>>> previous
>>>  value until a monotonic update is seen.
>>> +- drm-total-cycles-<keystr>: <uint>
>>> +
>>> +Engine identifier string must be the same as the one specified in the
>>> +drm-cycles-<keystr> tag and shall contain the total number cycles 
>>> for the given
>>> +engine.
>>> +
>>> +This is a timestamp in GPU unspecified unit that matches the update 
>>> rate
>>> +of drm-cycles-<keystr>. For drivers that implement this interface, 
>>> the engine
>>> +utilization can be calculated entirely on the GPU clock domain, without
>>> +considering the CPU sleep time between 2 samples.
>>
>> Two opens.
>>
>> 1)
>> Do we need to explicity document that drm-total-cycles and drm-maxfreq 
>> are mutually exclusive?
> 
> so userspace has a fallback mechanism to calculate utilization depending
> on what keys are available?

No, to document all three at once do not make sense. Or at least are not 
expected. Or you envisage someone might legitimately emit all three? I 
don't see what would be the semantics. When we have cycles+maxfreq the 
latter is in Hz. And when we have cycles+total then it is unitless. All 
three?

>> 2)
>> Should drm-total-cycles for now be documents as driver specific?
> 
> you mean to call it xe-total-cycles?

Yes but it is not an ask, just an open.

>> I have added some more poeple in the cc who were involved with driver 
>> fdinfo implementations if they will have an opinion.
>>
>> I would say potentially yes, and promote it to common if more than one 
>> driver would use it.
>>
>> For instance I see panfrost has the driver specific drm-curfreq 
>> (although isn't documenting it fully in panfrost.rst). And I have to 
>> say it is somewhat questionable to expose the current frequency per 
>> fdinfo per engine but not my call.
> 
> aren't all of Documentation/gpu/drm-usage-stats.rst optional that
> driver may or may not implement? When you say driver-specific I'd think
> more of the ones not using <drm> as prefix as e.g. amd-*.
> 
> I think drm-cycles + drm-total-cycles is just an alternative
> implementation for engine utilization. Like drm-cycles + drm-maxfreq
> already is an alternative to drm-engine and is not implemented by e.g.
> amdgpu/i915.
> 
> I will submit a new version of the entire patch series to get the ball
> rolling, but let's keep this open for now.
> 
> <...>
> 
>>> +static void show_runtime(struct drm_printer *p, struct drm_file *file)
>>> +{
>>> +    struct xe_file *xef = file->driver_priv;
>>> +    struct xe_device *xe = xef->xe;
>>> +    struct xe_gt *gt;
>>> +    struct xe_hw_engine *hwe;
>>> +    struct xe_exec_queue *q;
>>> +    unsigned long i, id_hwe, id_gt, capacity[XE_ENGINE_CLASS_MAX] = 
>>> { };
>>> +    u64 gpu_timestamp, engine_mask = 0;
>>> +    bool gpu_stamp = false;
>>> +
>>> +    xe_pm_runtime_get(xe);
>>> +
>>> +    /* Accumulate all the exec queues from this client */
>>> +    mutex_lock(&xef->exec_queue.lock);
>>> +    xa_for_each(&xef->exec_queue.xa, i, q)
>>> +        xe_exec_queue_update_runtime(q);
>>> +    mutex_unlock(&xef->exec_queue.lock);
>>> +
>>> +
>>> +    /* Calculate capacity of each engine class */
>>> +    BUILD_BUG_ON(ARRAY_SIZE(class_to_mask) != XE_ENGINE_CLASS_MAX);
>>> +    for_each_gt(gt, xe, id_gt)
>>> +        engine_mask |= gt->info.engine_mask;
>>> +    for (i = 0; i < XE_ENGINE_CLASS_MAX; i++)
>>> +        capacity[i] = hweight64(engine_mask & class_to_mask[i]);
>>
>> FWIW the above two loops are static so could store capacity in struct 
>> xe_device.
> 
> yes, but just creating a cache in xe of something derived from gt is not
> something to consider lightly. Particularly considering the small number
> of xe->info.gt_count we have. For something that runs only when someone
> cat the fdinfo, this doesn't seem terrible.
> 
>>
>>> +
>>> +    /*
>>> +     * Iterate over all engines, printing the accumulated
>>> +     * runtime for this client, per engine class
>>> +     */
>>> +    for_each_gt(gt, xe, id_gt) {
>>> +        xe_force_wake_get(gt_to_fw(gt), XE_FW_GT);
>>> +        for_each_hw_engine(hwe, gt, id_hwe) {
>>> +            const char *class_name;
>>> +
>>> +            if (!capacity[hwe->class])
>>> +                continue;
>>> +
>>> +            /*
>>> +             * Use any (first) engine to have a timestamp to be used 
>>> every
>>> +             * time
>>> +             */
>>> +            if (!gpu_stamp) {
>>> +                gpu_timestamp = xe_hw_engine_read_timestamp(hwe);
>>> +                gpu_stamp = true;
>>> +            }
>>> +
>>> +            class_name = xe_hw_engine_class_to_str(hwe->class);
>>> +
>>> +            drm_printf(p, "drm-cycles-%s:\t%llu\n",
>>> +                   class_name, xef->runtime[hwe->class]);
>>> +            drm_printf(p, "drm-total-cycles-%s:\t%llu\n",
>>> +                   class_name, gpu_timestamp);
>>> +
>>> +            if (capacity[hwe->class] > 1)
>>> +                drm_printf(p, "drm-engine-capacity-%s:\t%lu\n",
>>> +                       class_name, capacity[hwe->class]);
>>> +
>>> +            /* engine class already handled, skip next iterations */
>>> +            capacity[hwe->class] = 0;
>>> +        }
>>> +        xe_force_wake_put(gt_to_fw(gt), XE_FW_GT);
>>> +    }
>>
>> More FWIW and AFAICT, could just walk the "list" of classes instead of 
> 
> xe_force_wake_get() is per gt, so the alternative would be... loop
> through the gts to get all forcewakes, loop through all engine classes, 
> loop
> again through all gts to put the forcewake. And we also need to consider
> that an engine class may not be available in all GTs... example:
> vcs/vecs in MTL and later, so we need to track it globally across GTs
> anyway.

Forcewake is only needed once for the gpu_timestamp, no? At least I 
don't see any other potential hardware access in the loop. Hence I 
thought if you could have a known engine to get the timestamp outside 
the loop, you could then run a flat loop (over classes) avoiding the per 
gt fw dance. Your choice ofc.

Regards,

Tvrtko

> 
>> the nested loop with skipping already visited classes. Assuming in xe 
>> there is an easy way to get a known present engine for the 
>> gpu_timestamp it would be flatter and less code.
> 
> from the device we can get either tile or gt. To work with the
> hw_engines we have to look inside the gt.
> 
> Lucas De Marchi
> 
>>
>> Regards,
>>
>> Tvrtko
>>
>>> +
>>> +    xe_pm_runtime_get(xe);
>>> +}
>>> +
>>>  /**
>>>   * xe_drm_client_fdinfo() - Callback for fdinfo interface
>>>   * @p: The drm_printer ptr
>>> @@ -192,5 +327,6 @@ static void show_meminfo(struct drm_printer *p, 
>>> struct drm_file *file)
>>>  void xe_drm_client_fdinfo(struct drm_printer *p, struct drm_file *file)
>>>  {
>>>      show_meminfo(p, file);
>>> +    show_runtime(p, file);
>>>  }
>>>  #endif

Lucas De Marchi May 8, 2024, 8:53 p.m. UTC | #4

On Wed, May 08, 2024 at 09:23:17AM GMT, Tvrtko Ursulin wrote:
>
>On 07/05/2024 22:35, Lucas De Marchi wrote:
>>On Fri, Apr 26, 2024 at 11:47:37AM GMT, Tvrtko Ursulin wrote:
>>>
>>>On 24/04/2024 00:56, Lucas De Marchi wrote:
>>>>Print the accumulated runtime for client when printing fdinfo.
>>>>Each time a query is done it first does 2 things:
>>>>
>>>>1) loop through all the exec queues for the current client and
>>>>   accumulate the runtime, per engine class. CTX_TIMESTAMP is used for
>>>>   that, being read from the context image.
>>>>
>>>>2) Read a "GPU timestamp" that can be used for considering "how much GPU
>>>>   time has passed" and that has the same unit/refclock as the one
>>>>   recording the runtime. RING_TIMESTAMP is used for that via MMIO.
>>>>
>>>>Since for all current platforms RING_TIMESTAMP follows the same
>>>>refclock, just read it once, using any first engine.
>>>>
>>>>This is exported to userspace as 2 numbers in fdinfo:
>>>>
>>>>    drm-cycles-<class>: <RUNTIME>
>>>>    drm-total-cycles-<class>: <TIMESTAMP>
>>>>
>>>>Userspace is expected to collect at least 2 samples, which allows to
>>>>know the client engine busyness as per:
>>>>
>>>>            RUNTIME1 - RUNTIME0
>>>>    busyness = ---------------------
>>>>              T1 - T0
>>>>
>>>>Another thing to point out is that it's expected that userspace will
>>>>read any 2 samples every few seconds.  Given the update frequency of the
>>>>counters involved and that CTX_TIMESTAMP is 32-bits, the counter for
>>>>each exec_queue can wrap around (assuming 100% utilization) after ~200s.
>>>>The wraparound is not perceived by userspace since it's just accumulated
>>>>for all the exec_queues in a 64-bit counter), but the measurement will
>>>>not be accurate if the samples are too far apart.
>>>>
>>>>This could be mitigated by adding a workqueue to accumulate the counters
>>>>every so often, but it's additional complexity for something that is
>>>>done already by userspace every few seconds in tools like gputop (from
>>>>igt), htop, nvtop, etc with none of them really defaulting to 1 sample
>>>>per minute or more.
>>>>
>>>>Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
>>>>---
>>>> Documentation/gpu/drm-usage-stats.rst       |  16 ++-
>>>> Documentation/gpu/xe/index.rst              |   1 +
>>>> Documentation/gpu/xe/xe-drm-usage-stats.rst |  10 ++
>>>> drivers/gpu/drm/xe/xe_drm_client.c          | 138 +++++++++++++++++++-
>>>> 4 files changed, 162 insertions(+), 3 deletions(-)
>>>> create mode 100644 Documentation/gpu/xe/xe-drm-usage-stats.rst
>>>>
>>>>diff --git a/Documentation/gpu/drm-usage-stats.rst 
>>>>b/Documentation/gpu/drm-usage-stats.rst
>>>>index 6dc299343b48..421766289b78 100644
>>>>--- a/Documentation/gpu/drm-usage-stats.rst
>>>>+++ b/Documentation/gpu/drm-usage-stats.rst
>>>>@@ -112,6 +112,17 @@ larger value within a reasonable period. 
>>>>Upon observing a value lower than what
>>>> was previously read, userspace is expected to stay with that 
>>>>larger previous
>>>> value until a monotonic update is seen.
>>>>+- drm-total-cycles-<keystr>: <uint>
>>>>+
>>>>+Engine identifier string must be the same as the one specified in the
>>>>+drm-cycles-<keystr> tag and shall contain the total number 
>>>>cycles for the given
>>>>+engine.
>>>>+
>>>>+This is a timestamp in GPU unspecified unit that matches the 
>>>>update rate
>>>>+of drm-cycles-<keystr>. For drivers that implement this 
>>>>interface, the engine
>>>>+utilization can be calculated entirely on the GPU clock domain, without
>>>>+considering the CPU sleep time between 2 samples.
>>>
>>>Two opens.
>>>
>>>1)
>>>Do we need to explicity document that drm-total-cycles and 
>>>drm-maxfreq are mutually exclusive?
>>
>>so userspace has a fallback mechanism to calculate utilization depending
>>on what keys are available?
>
>No, to document all three at once do not make sense. Or at least are 
>not expected. Or you envisage someone might legitimately emit all 
>three? I don't see what would be the semantics. When we have 
>cycles+maxfreq the latter is in Hz. And when we have cycles+total then 
>it is unitless. All three?

I don't follow what you mean here. *cycles* is actually a unit.

The engine spent 10 cycles running this context (drm-cycles). In the
same period there were 100 cycles available (drm-total-cycles). Current
frequency is X MHz. Max frequency is Y MHz. For me all of them make
sense if one wants to mix them together. For xe it doesn't make sense
because the counter backing drm-cycles and drm-total-cycles is unrelated
to the engine frequency.

I can add something in the doc that we do not expected to see all of them
together until we see a usecase. Each driver may implement a subset.

>
>>>2)
>>>Should drm-total-cycles for now be documents as driver specific?
>>
>>you mean to call it xe-total-cycles?
>
>Yes but it is not an ask, just an open.

Ok, my opinion is that we shouldn't. Just like we have drm-cycles today
implemented by some drivers, but not all. I'd consider the drm-curfreq,
not documented in the drm layer as something to be fixed or migrated to
a driver-only interface (probably not possible anymore as it'd break the
uapi).  Problem I see with turning it into xe-total-cycles, is that the
moment another driver decide to implement they will either have to use
xe- prefix or xe will need to start publishing both keys.
As said above, I can document that it's not expected to use both total
and maxfreq as it's currently the case.

>
>>>I have added some more poeple in the cc who were involved with 
>>>driver fdinfo implementations if they will have an opinion.
>>>
>>>I would say potentially yes, and promote it to common if more than 
>>>one driver would use it.
>>>
>>>For instance I see panfrost has the driver specific drm-curfreq 
>>>(although isn't documenting it fully in panfrost.rst). And I have 
>>>to say it is somewhat questionable to expose the current frequency 
>>>per fdinfo per engine but not my call.
>>
>>aren't all of Documentation/gpu/drm-usage-stats.rst optional that
>>driver may or may not implement? When you say driver-specific I'd think
>>more of the ones not using <drm> as prefix as e.g. amd-*.
>>
>>I think drm-cycles + drm-total-cycles is just an alternative
>>implementation for engine utilization. Like drm-cycles + drm-maxfreq
>>already is an alternative to drm-engine and is not implemented by e.g.
>>amdgpu/i915.
>>
>>I will submit a new version of the entire patch series to get the ball
>>rolling, but let's keep this open for now.
>>
>><...>
>>
>>>>+static void show_runtime(struct drm_printer *p, struct drm_file *file)
>>>>+{
>>>>+    struct xe_file *xef = file->driver_priv;
>>>>+    struct xe_device *xe = xef->xe;
>>>>+    struct xe_gt *gt;
>>>>+    struct xe_hw_engine *hwe;
>>>>+    struct xe_exec_queue *q;
>>>>+    unsigned long i, id_hwe, id_gt, 
>>>>capacity[XE_ENGINE_CLASS_MAX] = { };
>>>>+    u64 gpu_timestamp, engine_mask = 0;
>>>>+    bool gpu_stamp = false;
>>>>+
>>>>+    xe_pm_runtime_get(xe);
>>>>+
>>>>+    /* Accumulate all the exec queues from this client */
>>>>+    mutex_lock(&xef->exec_queue.lock);
>>>>+    xa_for_each(&xef->exec_queue.xa, i, q)
>>>>+        xe_exec_queue_update_runtime(q);
>>>>+    mutex_unlock(&xef->exec_queue.lock);
>>>>+
>>>>+
>>>>+    /* Calculate capacity of each engine class */
>>>>+    BUILD_BUG_ON(ARRAY_SIZE(class_to_mask) != XE_ENGINE_CLASS_MAX);
>>>>+    for_each_gt(gt, xe, id_gt)
>>>>+        engine_mask |= gt->info.engine_mask;
>>>>+    for (i = 0; i < XE_ENGINE_CLASS_MAX; i++)
>>>>+        capacity[i] = hweight64(engine_mask & class_to_mask[i]);
>>>
>>>FWIW the above two loops are static so could store capacity in 
>>>struct xe_device.
>>
>>yes, but just creating a cache in xe of something derived from gt is not
>>something to consider lightly. Particularly considering the small number
>>of xe->info.gt_count we have. For something that runs only when someone
>>cat the fdinfo, this doesn't seem terrible.
>>
>>>
>>>>+
>>>>+    /*
>>>>+     * Iterate over all engines, printing the accumulated
>>>>+     * runtime for this client, per engine class
>>>>+     */
>>>>+    for_each_gt(gt, xe, id_gt) {
>>>>+        xe_force_wake_get(gt_to_fw(gt), XE_FW_GT);
>>>>+        for_each_hw_engine(hwe, gt, id_hwe) {
>>>>+            const char *class_name;
>>>>+
>>>>+            if (!capacity[hwe->class])
>>>>+                continue;
>>>>+
>>>>+            /*
>>>>+             * Use any (first) engine to have a timestamp to be 
>>>>used every
>>>>+             * time
>>>>+             */
>>>>+            if (!gpu_stamp) {
>>>>+                gpu_timestamp = xe_hw_engine_read_timestamp(hwe);
>>>>+                gpu_stamp = true;
>>>>+            }
>>>>+
>>>>+            class_name = xe_hw_engine_class_to_str(hwe->class);
>>>>+
>>>>+            drm_printf(p, "drm-cycles-%s:\t%llu\n",
>>>>+                   class_name, xef->runtime[hwe->class]);
>>>>+            drm_printf(p, "drm-total-cycles-%s:\t%llu\n",
>>>>+                   class_name, gpu_timestamp);
>>>>+
>>>>+            if (capacity[hwe->class] > 1)
>>>>+                drm_printf(p, "drm-engine-capacity-%s:\t%lu\n",
>>>>+                       class_name, capacity[hwe->class]);
>>>>+
>>>>+            /* engine class already handled, skip next iterations */
>>>>+            capacity[hwe->class] = 0;
>>>>+        }
>>>>+        xe_force_wake_put(gt_to_fw(gt), XE_FW_GT);
>>>>+    }
>>>
>>>More FWIW and AFAICT, could just walk the "list" of classes 
>>>instead of
>>
>>xe_force_wake_get() is per gt, so the alternative would be... loop
>>through the gts to get all forcewakes, loop through all engine 
>>classes, loop
>>again through all gts to put the forcewake. And we also need to consider
>>that an engine class may not be available in all GTs... example:
>>vcs/vecs in MTL and later, so we need to track it globally across GTs
>>anyway.
>
>Forcewake is only needed once for the gpu_timestamp, no? At least I 
>don't see any other potential hardware access in the loop. Hence I 
>thought if you could have a known engine to get the timestamp outside 
>the loop, you could then run a flat loop (over classes) avoiding the 
>per gt fw dance. Your choice ofc.

makes sense... I will try this and run some tests.

thanks
Lucas De Marchi

Tvrtko Ursulin May 9, 2024, 9:39 a.m. UTC | #5

On 08/05/2024 21:53, Lucas De Marchi wrote:
> On Wed, May 08, 2024 at 09:23:17AM GMT, Tvrtko Ursulin wrote:
>>
>> On 07/05/2024 22:35, Lucas De Marchi wrote:
>>> On Fri, Apr 26, 2024 at 11:47:37AM GMT, Tvrtko Ursulin wrote:
>>>>
>>>> On 24/04/2024 00:56, Lucas De Marchi wrote:
>>>>> Print the accumulated runtime for client when printing fdinfo.
>>>>> Each time a query is done it first does 2 things:
>>>>>
>>>>> 1) loop through all the exec queues for the current client and
>>>>>    accumulate the runtime, per engine class. CTX_TIMESTAMP is used for
>>>>>    that, being read from the context image.
>>>>>
>>>>> 2) Read a "GPU timestamp" that can be used for considering "how 
>>>>> much GPU
>>>>>    time has passed" and that has the same unit/refclock as the one
>>>>>    recording the runtime. RING_TIMESTAMP is used for that via MMIO.
>>>>>
>>>>> Since for all current platforms RING_TIMESTAMP follows the same
>>>>> refclock, just read it once, using any first engine.
>>>>>
>>>>> This is exported to userspace as 2 numbers in fdinfo:
>>>>>
>>>>>     drm-cycles-<class>: <RUNTIME>
>>>>>     drm-total-cycles-<class>: <TIMESTAMP>
>>>>>
>>>>> Userspace is expected to collect at least 2 samples, which allows to
>>>>> know the client engine busyness as per:
>>>>>
>>>>>             RUNTIME1 - RUNTIME0
>>>>>     busyness = ---------------------
>>>>>               T1 - T0
>>>>>
>>>>> Another thing to point out is that it's expected that userspace will
>>>>> read any 2 samples every few seconds.  Given the update frequency 
>>>>> of the
>>>>> counters involved and that CTX_TIMESTAMP is 32-bits, the counter for
>>>>> each exec_queue can wrap around (assuming 100% utilization) after 
>>>>> ~200s.
>>>>> The wraparound is not perceived by userspace since it's just 
>>>>> accumulated
>>>>> for all the exec_queues in a 64-bit counter), but the measurement will
>>>>> not be accurate if the samples are too far apart.
>>>>>
>>>>> This could be mitigated by adding a workqueue to accumulate the 
>>>>> counters
>>>>> every so often, but it's additional complexity for something that is
>>>>> done already by userspace every few seconds in tools like gputop (from
>>>>> igt), htop, nvtop, etc with none of them really defaulting to 1 sample
>>>>> per minute or more.
>>>>>
>>>>> Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
>>>>> ---
>>>>>  Documentation/gpu/drm-usage-stats.rst       |  16 ++-
>>>>>  Documentation/gpu/xe/index.rst              |   1 +
>>>>>  Documentation/gpu/xe/xe-drm-usage-stats.rst |  10 ++
>>>>>  drivers/gpu/drm/xe/xe_drm_client.c          | 138 
>>>>> +++++++++++++++++++-
>>>>>  4 files changed, 162 insertions(+), 3 deletions(-)
>>>>>  create mode 100644 Documentation/gpu/xe/xe-drm-usage-stats.rst
>>>>>
>>>>> diff --git a/Documentation/gpu/drm-usage-stats.rst 
>>>>> b/Documentation/gpu/drm-usage-stats.rst
>>>>> index 6dc299343b48..421766289b78 100644
>>>>> --- a/Documentation/gpu/drm-usage-stats.rst
>>>>> +++ b/Documentation/gpu/drm-usage-stats.rst
>>>>> @@ -112,6 +112,17 @@ larger value within a reasonable period. Upon 
>>>>> observing a value lower than what
>>>>>  was previously read, userspace is expected to stay with that 
>>>>> larger previous
>>>>>  value until a monotonic update is seen.
>>>>> +- drm-total-cycles-<keystr>: <uint>
>>>>> +
>>>>> +Engine identifier string must be the same as the one specified in the
>>>>> +drm-cycles-<keystr> tag and shall contain the total number cycles 
>>>>> for the given
>>>>> +engine.
>>>>> +
>>>>> +This is a timestamp in GPU unspecified unit that matches the 
>>>>> update rate
>>>>> +of drm-cycles-<keystr>. For drivers that implement this interface, 
>>>>> the engine
>>>>> +utilization can be calculated entirely on the GPU clock domain, 
>>>>> without
>>>>> +considering the CPU sleep time between 2 samples.
>>>>
>>>> Two opens.
>>>>
>>>> 1)
>>>> Do we need to explicity document that drm-total-cycles and 
>>>> drm-maxfreq are mutually exclusive?
>>>
>>> so userspace has a fallback mechanism to calculate utilization depending
>>> on what keys are available?
>>
>> No, to document all three at once do not make sense. Or at least are 
>> not expected. Or you envisage someone might legitimately emit all 
>> three? I don't see what would be the semantics. When we have 
>> cycles+maxfreq the latter is in Hz. And when we have cycles+total then 
>> it is unitless. All three?
> 
> I don't follow what you mean here. *cycles* is actually a unit.
> 
> The engine spent 10 cycles running this context (drm-cycles). In the
> same period there were 100 cycles available (drm-total-cycles). Current
> frequency is X MHz. Max frequency is Y MHz. For me all of them make
> sense if one wants to mix them together. For xe it doesn't make sense
> because the counter backing drm-cycles and drm-total-cycles is unrelated
> to the engine frequency.
> 
> I can add something in the doc that we do not expected to see all of them
> together until we see a usecase. Each driver may implement a subset.

I still don't quite see how a combination of cycles, total cycles and 
maxfreq makes sense together. It would require a driver where cycle 
period is equal to 1 / maxfreq, which also means total-cycles would be 
equal to maxfreq, making one of them redundant. So both for drivers like 
xe where cycle period is unrelated to maxfreq (or even the fataly 
misguided curfreq) it doens't make sense, and for driver like above is 
not needed. What use case am I missing?

We need to document this properly so userspace knows how to do the right 
thing depending on what keys they discover.

>>>> 2)
>>>> Should drm-total-cycles for now be documents as driver specific?
>>>
>>> you mean to call it xe-total-cycles?
>>
>> Yes but it is not an ask, just an open.
> 
> Ok, my opinion is that we shouldn't. Just like we have drm-cycles today
> implemented by some drivers, but not all. I'd consider the drm-curfreq,
> not documented in the drm layer as something to be fixed or migrated to
> a driver-only interface (probably not possible anymore as it'd break the
> uapi).  Problem I see with turning it into xe-total-cycles, is that the
> moment another driver decide to implement they will either have to use
> xe- prefix or xe will need to start publishing both keys.
> As said above, I can document that it's not expected to use both total
> and maxfreq as it's currently the case.

I see your point. If as an alternative solution we would say it is okay 
to prefix driver specific keys drm-? It would legitimise panfrost 
drm-curfreq. Downside would be anyone adding new common keys would have 
to inspect all driver specific docs, and worse, the actual source since 
people do forget to document things.

Even having drm-cycles + xe-total-cycles wouldn't be spec compliant. 
Okay.. drm-total-cycles- sounds like the least bad solution. Luckily it 
feels a clean concept with some chance of reuse.

Regards,

Tvrtko

>>>> I have added some more poeple in the cc who were involved with 
>>>> driver fdinfo implementations if they will have an opinion.
>>>>
>>>> I would say potentially yes, and promote it to common if more than 
>>>> one driver would use it.
>>>>
>>>> For instance I see panfrost has the driver specific drm-curfreq 
>>>> (although isn't documenting it fully in panfrost.rst). And I have to 
>>>> say it is somewhat questionable to expose the current frequency per 
>>>> fdinfo per engine but not my call.
>>>
>>> aren't all of Documentation/gpu/drm-usage-stats.rst optional that
>>> driver may or may not implement? When you say driver-specific I'd think
>>> more of the ones not using <drm> as prefix as e.g. amd-*.
>>>
>>> I think drm-cycles + drm-total-cycles is just an alternative
>>> implementation for engine utilization. Like drm-cycles + drm-maxfreq
>>> already is an alternative to drm-engine and is not implemented by e.g.
>>> amdgpu/i915.
>>>
>>> I will submit a new version of the entire patch series to get the ball
>>> rolling, but let's keep this open for now.
>>>
>>> <...>
>>>
>>>>> +static void show_runtime(struct drm_printer *p, struct drm_file 
>>>>> *file)
>>>>> +{
>>>>> +    struct xe_file *xef = file->driver_priv;
>>>>> +    struct xe_device *xe = xef->xe;
>>>>> +    struct xe_gt *gt;
>>>>> +    struct xe_hw_engine *hwe;
>>>>> +    struct xe_exec_queue *q;
>>>>> +    unsigned long i, id_hwe, id_gt, capacity[XE_ENGINE_CLASS_MAX] 
>>>>> = { };
>>>>> +    u64 gpu_timestamp, engine_mask = 0;
>>>>> +    bool gpu_stamp = false;
>>>>> +
>>>>> +    xe_pm_runtime_get(xe);
>>>>> +
>>>>> +    /* Accumulate all the exec queues from this client */
>>>>> +    mutex_lock(&xef->exec_queue.lock);
>>>>> +    xa_for_each(&xef->exec_queue.xa, i, q)
>>>>> +        xe_exec_queue_update_runtime(q);
>>>>> +    mutex_unlock(&xef->exec_queue.lock);
>>>>> +
>>>>> +
>>>>> +    /* Calculate capacity of each engine class */
>>>>> +    BUILD_BUG_ON(ARRAY_SIZE(class_to_mask) != XE_ENGINE_CLASS_MAX);
>>>>> +    for_each_gt(gt, xe, id_gt)
>>>>> +        engine_mask |= gt->info.engine_mask;
>>>>> +    for (i = 0; i < XE_ENGINE_CLASS_MAX; i++)
>>>>> +        capacity[i] = hweight64(engine_mask & class_to_mask[i]);
>>>>
>>>> FWIW the above two loops are static so could store capacity in 
>>>> struct xe_device.
>>>
>>> yes, but just creating a cache in xe of something derived from gt is not
>>> something to consider lightly. Particularly considering the small number
>>> of xe->info.gt_count we have. For something that runs only when someone
>>> cat the fdinfo, this doesn't seem terrible.
>>>
>>>>
>>>>> +
>>>>> +    /*
>>>>> +     * Iterate over all engines, printing the accumulated
>>>>> +     * runtime for this client, per engine class
>>>>> +     */
>>>>> +    for_each_gt(gt, xe, id_gt) {
>>>>> +        xe_force_wake_get(gt_to_fw(gt), XE_FW_GT);
>>>>> +        for_each_hw_engine(hwe, gt, id_hwe) {
>>>>> +            const char *class_name;
>>>>> +
>>>>> +            if (!capacity[hwe->class])
>>>>> +                continue;
>>>>> +
>>>>> +            /*
>>>>> +             * Use any (first) engine to have a timestamp to be 
>>>>> used every
>>>>> +             * time
>>>>> +             */
>>>>> +            if (!gpu_stamp) {
>>>>> +                gpu_timestamp = xe_hw_engine_read_timestamp(hwe);
>>>>> +                gpu_stamp = true;
>>>>> +            }
>>>>> +
>>>>> +            class_name = xe_hw_engine_class_to_str(hwe->class);
>>>>> +
>>>>> +            drm_printf(p, "drm-cycles-%s:\t%llu\n",
>>>>> +                   class_name, xef->runtime[hwe->class]);
>>>>> +            drm_printf(p, "drm-total-cycles-%s:\t%llu\n",
>>>>> +                   class_name, gpu_timestamp);
>>>>> +
>>>>> +            if (capacity[hwe->class] > 1)
>>>>> +                drm_printf(p, "drm-engine-capacity-%s:\t%lu\n",
>>>>> +                       class_name, capacity[hwe->class]);
>>>>> +
>>>>> +            /* engine class already handled, skip next iterations */
>>>>> +            capacity[hwe->class] = 0;
>>>>> +        }
>>>>> +        xe_force_wake_put(gt_to_fw(gt), XE_FW_GT);
>>>>> +    }
>>>>
>>>> More FWIW and AFAICT, could just walk the "list" of classes instead of
>>>
>>> xe_force_wake_get() is per gt, so the alternative would be... loop
>>> through the gts to get all forcewakes, loop through all engine 
>>> classes, loop
>>> again through all gts to put the forcewake. And we also need to consider
>>> that an engine class may not be available in all GTs... example:
>>> vcs/vecs in MTL and later, so we need to track it globally across GTs
>>> anyway.
>>
>> Forcewake is only needed once for the gpu_timestamp, no? At least I 
>> don't see any other potential hardware access in the loop. Hence I 
>> thought if you could have a known engine to get the timestamp outside 
>> the loop, you could then run a flat loop (over classes) avoiding the 
>> per gt fw dance. Your choice ofc.
> 
> makes sense... I will try this and run some tests.
> 
> thanks
> Lucas De Marchi

[v2,6/6] drm/xe/client: Print runtime to fdinfo

Commit Message

Comments

Patch