drm/sched: Add error code parameter to drm_sched_start

Message ID	20240725164343.250074-1-vitaly.prosyak@amd.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C From: <vitaly.prosyak@amd.com> To: <amd-gfx@lists.freedesktop.org>, <dri-devel@lists.freedesktop.org>, <alexander.deucher@amd.com>, <christian.koenig@amd.com>, <jesse.zhang@amd.com> CC: Vitaly Prosyak <vitaly.prosyak@amd.com>, Jesse Zhang <Jesse.Zhang@amd.com> Subject: [PATCH] drm/sched: Add error code parameter to drm_sched_start Date: Thu, 25 Jul 2024 12:43:43 -0400 Message-ID: <20240725164343.250074-1-vitaly.prosyak@amd.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain Received-SPF: None (SATLEXMB04.amd.com: vitaly.prosyak@amd.com does not designate permitted sender hosts) Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	drm/sched: Add error code parameter to drm_sched_start \| expand drm/sched: Add error code parameter to drm_sched_start

Message ID

20240725164343.250074-1-vitaly.prosyak@amd.com (mailing list archive)

State

New, archived

Headers

Received-SPF: Pass (protection.outlook.com: domain of amd.com designates
 165.204.84.17 as permitted sender) receiver=protection.outlook.com;
 client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C
From: <vitaly.prosyak@amd.com>
To: <amd-gfx@lists.freedesktop.org>, <dri-devel@lists.freedesktop.org>,
 <alexander.deucher@amd.com>, <christian.koenig@amd.com>,
 <jesse.zhang@amd.com>
CC: Vitaly Prosyak <vitaly.prosyak@amd.com>, Jesse Zhang <Jesse.Zhang@amd.com>
Subject: [PATCH] drm/sched: Add error code parameter to drm_sched_start
Date: Thu, 25 Jul 2024 12:43:43 -0400
Message-ID: <20240725164343.250074-1-vitaly.prosyak@amd.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
Received-SPF: None (SATLEXMB04.amd.com: vitaly.prosyak@amd.com does not
 designate permitted sender hosts)
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 25 Jul 2024 16:43:53.2275 (UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 0eea1d51-d45e-476d-ee4c-08dcacc8f7ef
X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d
X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: 
 TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d; Ip=[165.204.84.17];
 Helo=[SATLEXMB04.amd.com]
X-MS-Exchange-CrossTenant-AuthSource: 
 BN1PEPF0000468E.namprd05.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Anonymous
X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem
X-MS-Exchange-Transport-CrossTenantHeadersStamped: SJ1PR12MB6122
X-BeenThere: dri-devel@lists.freedesktop.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Direct Rendering Infrastructure - Development
 <dri-devel.lists.freedesktop.org>
List-Unsubscribe: <https://lists.freedesktop.org/mailman/options/dri-devel>,
 <mailto:dri-devel-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <https://lists.freedesktop.org/archives/dri-devel>
List-Post: <mailto:dri-devel@lists.freedesktop.org>
List-Help: <mailto:dri-devel-request@lists.freedesktop.org?subject=help>
List-Subscribe: <https://lists.freedesktop.org/mailman/listinfo/dri-devel>,
 <mailto:dri-devel-request@lists.freedesktop.org?subject=subscribe>
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

Series

drm/sched: Add error code parameter to drm_sched_start | expand

Commit Message

Vitaly Prosyak July 25, 2024, 4:43 p.m. UTC

From: Vitaly Prosyak <vitaly.prosyak@amd.com>

The current implementation of drm_sched_start uses a hardcoded -ECANCELED to dispose of a job when
the parent/hw fence is NULL. This results in drm_sched_job_done being called with -ECANCELED for
each job with a NULL parent in the pending list, making it difficult to distinguish between recovery
methods, whether a queue reset or a full GPU reset was used.

To improve this, we first try a soft recovery for timeout jobs and use the error code -ENODATA.
If soft recovery fails, we proceed with a queue reset, where the error code remains -ENODATA for
the job. Finally, for a full GPU reset, we use error codes -ECANCELED or -ETIME. This patch adds
an error code parameter to drm_sched_start, allowing us to differentiate between queue reset and
GPU reset failures. This enables user mode and test applications to validate the expected
correctness of the requested operation. After a successful queue reset, the only way to continue
normal operation is to call drm_sched_job_done with the specific error code -ENODATA.

v1: Initial implementation by Jesse utilized amdgpu_device_lock_reset_domain and
    amdgpu_device_unlock_reset_domain to allow user mode to track the queue reset status and distinguish
    between queue reset and GPU reset.
v2: Christian suggested using the error codes -ENODATA for queue reset and -ECANCELED or -ETIME for GPU
    reset, returned to amdgpu_cs_wait_ioctl.
v3: To meet the requirements, we introduce a new function drm_sched_start_ex with an additional parameter
    to set dma_fence_set_error, allowing us to handle the specific error codes appropriately and dispose
    of bad jobs with the selected error code depending on whether it was a queue reset or GPU reset.
v4: Alex suggested using a new name, drm_sched_start_with_recovery_error, which more accurately describes
    the function's purpose. Additionally, it was recommended to add documentation details about the new method.
v5: Fixed declaration of new function drm_sched_start_with_recovery_error.(Alex)

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Signed-off-by: Jesse Zhang <Jesse.Zhang@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
---
 drivers/gpu/drm/scheduler/sched_main.c | 30 +++++++++++++++++++++++---
 include/drm/gpu_scheduler.h            |  1 +
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index 7e90c9f95611..c42449358b3f 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -671,13 +671,24 @@  void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
 EXPORT_SYMBOL(drm_sched_stop);
 
 /**
- * drm_sched_start - recover jobs after a reset
+ * drm_sched_start_with_recovery_error - recover jobs after a reset with
+ * custom error
  *
  * @sched: scheduler instance
  * @full_recovery: proceed with complete sched restart
+ * @error : err code for set dma_fence_set_error
+ *
+ * Starts the scheduler and allows setting a custom dma_fence_set_error,
+ * which can be used to identify the recovery mechanism actually used.
  *
+ * For example:
+ * - If a soft or queue reset was used, dma_fence_set_error is set to -ENODATA.
+ * - If an entire GPU reset was used, the error code is set to -ECANCELED.
+ *
+ * This approach enables user mode and test applications to know which
+ * recovery method was used for a given bad job.
  */
-void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
+void drm_sched_start_with_recovery_error(struct drm_gpu_scheduler *sched, bool full_recovery, int error)
 {
 	struct drm_sched_job *s_job, *tmp;
 	int r;
@@ -704,7 +715,7 @@  void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
 				DRM_DEV_ERROR(sched->dev, "fence add callback failed (%d)\n",
 					  r);
 		} else
-			drm_sched_job_done(s_job, -ECANCELED);
+			drm_sched_job_done(s_job, error);
 	}
 
 	if (full_recovery)
@@ -712,6 +723,19 @@  void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
 
 	drm_sched_wqueue_start(sched);
 }
+EXPORT_SYMBOL(drm_sched_start_with_recovery_error);
+
+/**
+ * drm_sched_start - recover jobs after a reset
+ *
+ * @sched: scheduler instance
+ * @full_recovery: proceed with complete sched restart
+ *
+ */
+void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery)
+{
+	drm_sched_start_with_recovery_error(sched, full_recovery, -ECANCELED);
+}
 EXPORT_SYMBOL(drm_sched_start);
 
 /**
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 5acc64954a88..e80ea947a864 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -580,6 +580,7 @@  void drm_sched_wqueue_stop(struct drm_gpu_scheduler *sched);
 void drm_sched_wqueue_start(struct drm_gpu_scheduler *sched);
 void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad);
 void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery);
+void drm_sched_start_with_recovery_error(struct drm_gpu_scheduler *sched, bool full_recovery, int error);
 void drm_sched_resubmit_jobs(struct drm_gpu_scheduler *sched);
 void drm_sched_increase_karma(struct drm_sched_job *bad);
 void drm_sched_reset_karma(struct drm_sched_job *bad);

drm/sched: Add error code parameter to drm_sched_start

Commit Message

Patch