[v3,12/13] drm/sched/doc: Add Entity teardown documentaion

Message ID	20230912021615.2086698-13-matthew.brost@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Matthew Brost <matthew.brost@intel.com> To: dri-devel@lists.freedesktop.org, intel-xe@lists.freedesktop.org Subject: [PATCH v3 12/13] drm/sched/doc: Add Entity teardown documentaion Date: Mon, 11 Sep 2023 19:16:14 -0700 Message-Id: <20230912021615.2086698-13-matthew.brost@intel.com> In-Reply-To: <20230912021615.2086698-1-matthew.brost@intel.com> References: <20230912021615.2086698-1-matthew.brost@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Cc: robdclark@chromium.org, thomas.hellstrom@linux.intel.com, Matthew Brost <matthew.brost@intel.com>, sarah.walker@imgtec.com, ketil.johnsen@arm.com, mcanal@igalia.com, Liviu.Dudau@arm.com, luben.tuikov@amd.com, lina@asahilina.net, donald.robson@imgtec.com, boris.brezillon@collabora.com, christian.koenig@amd.com, faith.ekstrand@collabora.com Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	DRM scheduler changes for Xe \| expand [v3,00/13] DRM scheduler changes for Xe [v3,01/13] drm/sched: Add drm_sched_submit_* helpers [v3,02/13] drm/sched: Convert drm scheduler to use a work queue rather than kthread [v3,03/13] drm/sched: Move schedule policy to scheduler / entity [v3,04/13] drm/sched: Add DRM_SCHED_POLICY_SINGLE_ENTITY scheduling policy [v3,05/13] drm/sched: Split free_job into own work item [v3,06/13] drm/sched: Add generic scheduler message interface [v3,07/13] drm/sched: Add drm_sched_start_timeout_unlocked helper [v3,08/13] drm/sched: Start run wq before TDR in drm_sched_start [v3,09/13] drm/sched: Submit job before starting TDR [v3,10/13] drm/sched: Add helper to set TDR timeout [v3,11/13] drm/sched: Waiting for pending jobs to complete in scheduler kill [v3,12/13] drm/sched/doc: Add Entity teardown documentaion [v3,13/13] drm/sched: Update maintainers of GPU scheduler

Message ID

20230912021615.2086698-13-matthew.brost@intel.com (mailing list archive)

State

New, archived

Headers

From: Matthew Brost <matthew.brost@intel.com>
To: dri-devel@lists.freedesktop.org,
	intel-xe@lists.freedesktop.org
Subject: [PATCH v3 12/13] drm/sched/doc: Add Entity teardown documentaion
Date: Mon, 11 Sep 2023 19:16:14 -0700
Message-Id: <20230912021615.2086698-13-matthew.brost@intel.com>
In-Reply-To: <20230912021615.2086698-1-matthew.brost@intel.com>
References: <20230912021615.2086698-1-matthew.brost@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: list
Cc: robdclark@chromium.org, thomas.hellstrom@linux.intel.com,
 Matthew Brost <matthew.brost@intel.com>, sarah.walker@imgtec.com,
 ketil.johnsen@arm.com, mcanal@igalia.com, Liviu.Dudau@arm.com,
 luben.tuikov@amd.com, lina@asahilina.net, donald.robson@imgtec.com,
 boris.brezillon@collabora.com, christian.koenig@amd.com,
 faith.ekstrand@collabora.com
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

Series

DRM scheduler changes for Xe | expand

Commit Message

Matthew Brost Sept. 12, 2023, 2:16 a.m. UTC

Provide documentation to guide in ways to teardown an entity.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 Documentation/gpu/drm-mm.rst             |  6 ++++++
 drivers/gpu/drm/scheduler/sched_entity.c | 19 +++++++++++++++++++
 2 files changed, 25 insertions(+)

Comments

Christian König Sept. 13, 2023, 3:04 p.m. UTC | #1

Am 12.09.23 um 04:16 schrieb Matthew Brost:
> Provide documentation to guide in ways to teardown an entity.
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   Documentation/gpu/drm-mm.rst             |  6 ++++++
>   drivers/gpu/drm/scheduler/sched_entity.c | 19 +++++++++++++++++++
>   2 files changed, 25 insertions(+)
>
> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> index c19b34b1c0ed..cb4d6097897e 100644
> --- a/Documentation/gpu/drm-mm.rst
> +++ b/Documentation/gpu/drm-mm.rst
> @@ -552,6 +552,12 @@ Overview
>   .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>      :doc: Overview
>   
> +Entity teardown
> +---------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_entity.c
> +   :doc: Entity teardown
> +
>   Scheduler Function References
>   -----------------------------
>   
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 37557fbb96d0..76f3e10218bb 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -21,6 +21,25 @@
>    *
>    */
>   
> +/**
> + * DOC: Entity teardown
> + *
> + * Drivers can teardown down an entity for several reasons. Reasons typically
> + * are a user closes the entity via an IOCTL, the FD associated with the entity
> + * is closed, or the entity encounters an error. The GPU scheduler provides the
> + * basic infrastructure to do this in a few different ways.
> + *
> + * 1. Let the entity run dry (both the pending list and job queue) and then call
> + * drm_sched_entity_fini. The backend can accelerate the process of running dry.
> + * For example set a flag so run_job is a NOP and set the TDR to a low value to
> + * signal all jobs in a timely manner (this example works for
> + * DRM_SCHED_POLICY_SINGLE_ENTITY).

Please note that it is a requirement from the X server that all 
externally visible effects of command submission must still be visible 
even after the fd is closed.

This has given us tons amount of headache and is one of the reasons we 
have the drm_sched_entity_flush() handling in the first place.

As long as you don't care about X server compatibility that shouldn't 
matter to you.

Regards,
Christian.

> + *
> + * 2. Kill the entity directly via drm_sched_entity_flush /
> + * drm_sched_entity_fini ensuring all pending and queued jobs are off the
> + * hardware and signaled.



> + */
> +
>   #include <linux/kthread.h>
>   #include <linux/slab.h>
>   #include <linux/completion.h>

Luben Tuikov Sept. 14, 2023, 2:06 a.m. UTC | #2

On 2023-09-11 22:16, Matthew Brost wrote:
> Provide documentation to guide in ways to teardown an entity.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>  Documentation/gpu/drm-mm.rst             |  6 ++++++
>  drivers/gpu/drm/scheduler/sched_entity.c | 19 +++++++++++++++++++
>  2 files changed, 25 insertions(+)
> 
> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> index c19b34b1c0ed..cb4d6097897e 100644
> --- a/Documentation/gpu/drm-mm.rst
> +++ b/Documentation/gpu/drm-mm.rst
> @@ -552,6 +552,12 @@ Overview
>  .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>     :doc: Overview
>  
> +Entity teardown
> +---------------
> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_entity.c
> +   :doc: Entity teardown
> +
>  Scheduler Function References
>  -----------------------------
>  
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 37557fbb96d0..76f3e10218bb 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -21,6 +21,25 @@
>   *
>   */
>  
> +/**
> + * DOC: Entity teardown
> + *
> + * Drivers can teardown down an entity for several reasons. Reasons typically
> + * are a user closes the entity via an IOCTL, the FD associated with the entity
> + * is closed, or the entity encounters an error.

So in this third case, "entity encounters an error", we need to make sure
that no new jobs are being pushed to the entity, or at least say that here.
IOW, in all three cases, the common denominator (requirement?) is that no new
jobs are being pushed to the entity, i.e. that there are no incoming jobs.

> The GPU scheduler provides the
> + * basic infrastructure to do this in a few different ways.

Well, I'd say "in two different ways." or "in the following two ways."
I'd rather have "two" in there to make sure that it is these two, and
not any more/less/etc.

> + *
> + * 1. Let the entity run dry (both the pending list and job queue) and then call
> + * drm_sched_entity_fini. The backend can accelerate the process of running dry.
> + * For example set a flag so run_job is a NOP and set the TDR to a low value to
> + * signal all jobs in a timely manner (this example works for
> + * DRM_SCHED_POLICY_SINGLE_ENTITY).
> + *
> + * 2. Kill the entity directly via drm_sched_entity_flush /
> + * drm_sched_entity_fini ensuring all pending and queued jobs are off the
> + * hardware and signaled.
> + */
> +
>  #include <linux/kthread.h>
>  #include <linux/slab.h>
>  #include <linux/completion.h>

Danilo Krummrich Sept. 16, 2023, 6:06 p.m. UTC | #3

On 9/12/23 04:16, Matthew Brost wrote:
> Provide documentation to guide in ways to teardown an entity.
> 
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   Documentation/gpu/drm-mm.rst             |  6 ++++++
>   drivers/gpu/drm/scheduler/sched_entity.c | 19 +++++++++++++++++++
>   2 files changed, 25 insertions(+)
> 
> diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
> index c19b34b1c0ed..cb4d6097897e 100644
> --- a/Documentation/gpu/drm-mm.rst
> +++ b/Documentation/gpu/drm-mm.rst
> @@ -552,6 +552,12 @@ Overview
>   .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
>      :doc: Overview
>   
> +Entity teardown
> +---------------

While I think it is good to document this as well, my concern was more about tearing
down the drm_gpu_scheduler. (See also my response to patch 11 of this series.)

How do we ensure that the pending_list is actually empty before calling
drm_sched_fini()? If we don't, we potentially leak memory.

For instance, we could let drm_sched_fini() (or a separate drm_sched_teardown())
cancel run work first and leave free work running until the pending_list is empty.

If we think drivers should take care themselves (e.g. through reference counting jobs
per scheduler), we should document this and explain why we can't have the scheduler do
this for us.

> +
> +.. kernel-doc:: drivers/gpu/drm/scheduler/sched_entity.c
> +   :doc: Entity teardown
> +
>   Scheduler Function References
>   -----------------------------
>   
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 37557fbb96d0..76f3e10218bb 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -21,6 +21,25 @@
>    *
>    */
>   
> +/**
> + * DOC: Entity teardown
> + *
> + * Drivers can teardown down an entity for several reasons. Reasons typically
> + * are a user closes the entity via an IOCTL, the FD associated with the entity
> + * is closed, or the entity encounters an error. The GPU scheduler provides the
> + * basic infrastructure to do this in a few different ways.
> + *
> + * 1. Let the entity run dry (both the pending list and job queue) and then call
> + * drm_sched_entity_fini. The backend can accelerate the process of running dry.
> + * For example set a flag so run_job is a NOP and set the TDR to a low value to
> + * signal all jobs in a timely manner (this example works for
> + * DRM_SCHED_POLICY_SINGLE_ENTITY).
> + *
> + * 2. Kill the entity directly via drm_sched_entity_flush /
> + * drm_sched_entity_fini ensuring all pending and queued jobs are off the
> + * hardware and signaled.
> + */
> +
>   #include <linux/kthread.h>
>   #include <linux/slab.h>
>   #include <linux/completion.h>

diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
index c19b34b1c0ed..cb4d6097897e 100644
--- a/Documentation/gpu/drm-mm.rst
+++ b/Documentation/gpu/drm-mm.rst
@@ -552,6 +552,12 @@  Overview
 .. kernel-doc:: drivers/gpu/drm/scheduler/sched_main.c
    :doc: Overview
 
+Entity teardown
+---------------
+
+.. kernel-doc:: drivers/gpu/drm/scheduler/sched_entity.c
+   :doc: Entity teardown
+
 Scheduler Function References
 -----------------------------
 
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
index 37557fbb96d0..76f3e10218bb 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -21,6 +21,25 @@ 
  *
  */
 
+/**
+ * DOC: Entity teardown
+ *
+ * Drivers can teardown down an entity for several reasons. Reasons typically
+ * are a user closes the entity via an IOCTL, the FD associated with the entity
+ * is closed, or the entity encounters an error. The GPU scheduler provides the
+ * basic infrastructure to do this in a few different ways.
+ *
+ * 1. Let the entity run dry (both the pending list and job queue) and then call
+ * drm_sched_entity_fini. The backend can accelerate the process of running dry.
+ * For example set a flag so run_job is a NOP and set the TDR to a low value to
+ * signal all jobs in a timely manner (this example works for
+ * DRM_SCHED_POLICY_SINGLE_ENTITY).
+ *
+ * 2. Kill the entity directly via drm_sched_entity_flush /
+ * drm_sched_entity_fini ensuring all pending and queued jobs are off the
+ * hardware and signaled.
+ */
+
 #include <linux/kthread.h>
 #include <linux/slab.h>
 #include <linux/completion.h>

[v3,12/13] drm/sched/doc: Add Entity teardown documentaion

Commit Message

Comments

Patch