From patchwork Fri Dec 4 03:17:22 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Luben Tuikov X-Patchwork-Id: 11950579 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.5 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,MSGID_FROM_MTA_HEADER,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 84098C19437 for ; Fri, 4 Dec 2020 03:17:59 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 1EE2B22513 for ; Fri, 4 Dec 2020 03:17:59 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1EE2B22513 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=amd.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=dri-devel-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id BD4846E133; Fri, 4 Dec 2020 03:17:44 +0000 (UTC) Received: from NAM04-BN8-obe.outbound.protection.outlook.com (mail-bn8nam08on2071.outbound.protection.outlook.com [40.107.100.71]) by gabe.freedesktop.org (Postfix) with ESMTPS id C8FFA6E123; Fri, 4 Dec 2020 03:17:41 +0000 (UTC) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=avPkazG0+m578ddi1/0sSDt1iZhOhEZ0tbUvkpX6q9+PnWocrQzg+KaxbsyKZr7uW+HOV4GYvDRH66S5XM5veQxtjIDKdFA/g97W+1xPsHTFiv40IptEvkbUosl1ek4O9ZfuwEPURVVj8k69eza7a1xrcGOSKYoknG4ZAZSVilr7by2seggBudpa2j9tCYMx5+STHgEHkrf3xyzvscLr0I6a2LnKkoUgzTEXQM/0ybVwgtpLOC7ARvNOkDArayDkJelnfMU0qf2sFvHwzuhbSfazTIe6Ryms6sDJlC86xqeI4ZbfOfv1XvsG+yFRQByvJ0++LTKJYV+ebP6BPNWn9w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=Gkuk/ErwX3+yjvikCYPoCloMguLDvV1l55DlQTx7Ij8=; b=csspp03T+26Ch6y+8rhQOGRb+fFIQ1Kb5d8psfyeChOLpZ39dy1FITCp4122fOxqJUIsvWZeTaNeOnp3K2DJCHk3aSne9xRbyTVGSgmgRg22A4kMEol6uopOH9K3ZfX4pVA8aeJstS7bwo7inBR9EA2lLPuB5zuTnUR+2ZabQCbG0J6OaBJ6zcX43L2YFPL8aNuvO/gByxC2iQXEjK+Ss05pmosmfKMgKARjlzhsjUKl9omO4KZAdgHTCmm6BAJD114xDBnm3ESlTW148w+cp/+eQ3kz8GzSBzfcW06UQxwXgeFgcGu1Geag5xjlxHP+hPa/TytOvkleXkEZLgvhmA== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=amd.com; dmarc=pass action=none header.from=amd.com; dkim=pass header.d=amd.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amdcloud.onmicrosoft.com; s=selector2-amdcloud-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=Gkuk/ErwX3+yjvikCYPoCloMguLDvV1l55DlQTx7Ij8=; b=n1d33hADrBjipyixcWosa2lMUEcn+oQ5t5ZQIZQqmOhtz03S7BVcwdtWNem7afhm8buebPxs33euPn4/eI6aNxkcwGQNsLTs1ez5/PJ6xjKrk6qxYDU8Kkqfbn5gL+bZkYBtMC4d/5zz4LCq1tEitPFnY5lB5J0QKa/LP+U9mjk= Authentication-Results: lists.freedesktop.org; dkim=none (message not signed) header.d=none; lists.freedesktop.org; dmarc=none action=none header.from=amd.com; Received: from DM6PR12MB3962.namprd12.prod.outlook.com (2603:10b6:5:1ce::21) by DM6PR12MB4043.namprd12.prod.outlook.com (2603:10b6:5:216::14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3632.17; Fri, 4 Dec 2020 03:17:40 +0000 Received: from DM6PR12MB3962.namprd12.prod.outlook.com ([fe80::d055:19dc:5b0f:ed56]) by DM6PR12MB3962.namprd12.prod.outlook.com ([fe80::d055:19dc:5b0f:ed56%6]) with mapi id 15.20.3632.021; Fri, 4 Dec 2020 03:17:40 +0000 From: Luben Tuikov To: amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org Subject: [PATCH 5/5] drm/sched: Make use of a "done" list (v2) Date: Thu, 3 Dec 2020 22:17:22 -0500 Message-Id: <20201204031722.24040-6-luben.tuikov@amd.com> X-Mailer: git-send-email 2.29.2.404.ge67fbf927d In-Reply-To: <20201204031722.24040-1-luben.tuikov@amd.com> References: <20201204031722.24040-1-luben.tuikov@amd.com> X-Originating-IP: [165.204.55.250] X-ClientProxiedBy: CH2PR02CA0026.namprd02.prod.outlook.com (2603:10b6:610:4e::36) To DM6PR12MB3962.namprd12.prod.outlook.com (2603:10b6:5:1ce::21) MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 1 Received: from localhost.localdomain.amd.com (165.204.55.250) by CH2PR02CA0026.namprd02.prod.outlook.com (2603:10b6:610:4e::36) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3632.17 via Frontend Transport; Fri, 4 Dec 2020 03:17:39 +0000 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-HT: Tenant X-MS-Office365-Filtering-Correlation-Id: cd23bbc5-ae9e-4e60-ce36-08d898032840 X-MS-TrafficTypeDiagnostic: DM6PR12MB4043: X-MS-Exchange-Transport-Forked: True X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:773; X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: 1+iZubIg5Q1P4j5ZsN4siFNyz3ILGZFrygINo2k+V16FfX3++z7T1jrbBohr3IcG/w4ZTkgFfVHdA/Vby5Xyqbms1kbFRlg12qqUYTFwFiQbjQ++aQTlOdklwFw6GoFftto+KEdOUVWhqnx042Ad5cUNRYyjyqdel80yRJtz46bF1djAAjKKzZLS5BK2T91FkPXsXyBu52Z6BgddvGCbsxYpBZeuw/NDaty2mFAgpEl/DxvvRDdXjTjVkxtf+RdQF5QJ2+gqYItkBuxf6csVsnw7DD64/T3BU24mTV+EC6k4vaj2Ki/RBiMNNSwPi+uuCBVXKMuhF4IEV8awUw7ArQ== X-Forefront-Antispam-Report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:DM6PR12MB3962.namprd12.prod.outlook.com; PTR:; CAT:NONE; SFS:(4636009)(376002)(39860400002)(366004)(346002)(136003)(396003)(16526019)(30864003)(5660300002)(66946007)(956004)(1076003)(6486002)(2906002)(66556008)(66476007)(4326008)(6666004)(2616005)(83380400001)(186003)(36756003)(8676002)(86362001)(8936002)(26005)(44832011)(54906003)(7696005)(7416002)(66574015)(52116002)(478600001)(316002); DIR:OUT; SFP:1101; X-MS-Exchange-AntiSpam-MessageData: =?utf-8?q?cHT9yqpr0Qb40vaF7m+8HgAo53hgVE?= =?utf-8?q?OnvANiI9NtLsOpUPmSZXL2MUPvhGh4BEN+z1YmgFrKK/KGw7wHMKiriwu+Qp+dYL2?= =?utf-8?q?XiO7Yjb7iI+lGa31mAsEZNmrkHhrkKY9W+5uLKVPjghIpDIzIki3M5MZbsk8O1XuX?= =?utf-8?q?QrMKvs72tvAmJJVpfedBKk2sYRtT6OMC83KU9Uyd1VSeFu89H0HVmq0HVyE45mYRS?= =?utf-8?q?1AEuPAW3tanznsqpmef8QtcSd1K7aHRu4oa086xuKNS8Ok/IzeGqPxwG/yYRBNkRM?= =?utf-8?q?5YsxJKBEhYfx2KsFZTO1Ar0DFFRFW84u5tcUNnj7J77/ZraWSn/a4kGE0NkwuuwMW?= =?utf-8?q?GJJUO1VkrSgO5taWLf6VIO86Jd/48MVb/SCZa3FwGqfKEe01vSgOiKR+ejF95eMai?= =?utf-8?q?mOKsGoEqWXpynU8hUfB5K6xcV6/XS30HHYWM5M1n8q88GJ3Ynl4VLD4MnxOYE0K5A?= =?utf-8?q?0LOmRE7WwGUQg2eW52u4lXtMLPSwvj91b6aW6P52LrVay15mleFXzMfOXrB+Lp6ys?= =?utf-8?q?skWjBnPYhZ1QcwLTCE6WRpTAuwLfSaAqCbq9lXxo6OIcP8m6uxGAYlqgQcFp0/iuT?= =?utf-8?q?/YCd0J1rdR177aI4qx3CLdww/mCkTkrUWHe5Y7sGXzKedIaJEgpteZVmqzS/RowzF?= =?utf-8?q?nnq9ALEjH+z59AZjtiiS6XT4zzKTjBsn5QQiW7cLLJJGc5TAPA16EqZifT854Iqax?= =?utf-8?q?4tBJTt4L5EYv/WaqjmCmrKEtQt65Z8g+7TbDGV1hDQ3mqeP3ve+ZAAF2ga+ckIxes?= =?utf-8?q?19WzLVQPdrFlAcLOrS7MO+kkk+ENi2e0U91k0gtiVNh7rohgsBweuOLTfQXhRmVS9?= =?utf-8?q?wzCAgexbLsyRQARN0BZlh3QPsDUYK7isB1aASrMtkG/n317o9rstd5UHy0WSHIUGl?= =?utf-8?q?zvx+8gdPTrbPIsgUM18mBaKYCgozlTDpuBCml5sXbIBdZJBEl1KvBbBttgetvEaBe?= =?utf-8?q?Ik5pNq3juuWAbUXy8c7?= X-OriginatorOrg: amd.com X-MS-Exchange-CrossTenant-Network-Message-Id: cd23bbc5-ae9e-4e60-ce36-08d898032840 X-MS-Exchange-CrossTenant-AuthSource: DM6PR12MB3962.namprd12.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 04 Dec 2020 03:17:40.2293 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: CxHA2x5qRlUOQzKAW05b1DZbOwqB/iRMWrwQoGY7ZN1zZ2HgRnBAn4ZiwTdhAwqA X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM6PR12MB4043 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Tomeu Vizoso , Daniel Vetter , Alyssa Rosenzweig , Steven Price , Luben Tuikov , Qiang Yu , Russell King , Alexander Deucher , =?utf-8?q?Christian_K=C3=B6n?= =?utf-8?q?ig?= Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" The drm_sched_job_done() callback now moves done jobs from the pending list to a "done" list. In drm_sched_job_timeout, make use of the status returned by a GPU driver job timeout handler to decide whether to leave the oldest job in the pending list, or to send it off to the done list. If a driver's job timeout callback returns a status that that job is done, it is added to the done list and the done thread woken up. If that job needs more time, it is left on the pending list and the timeout timer restarted. The idea is that a GPU driver can check the IP to which the passed-in job belongs to and determine whether the IP is alive and well, or if it needs more time to complete this job and perhaps others also executing on it. In drm_sched_job_timeout(), the main scheduler thread is now parked, before calling a driver's timeout_job callback, so as to not compete pushing jobs down to the GPU while the recovery method is taking place. Eliminate the polling mechanism of picking out done jobs from the pending list, i.e. eliminate drm_sched_get_cleanup_job(). This also eliminates the eldest job disappearing from the pending list, while the driver timeout handler is called. Various other optimizations to the GPU scheduler and job recovery are possible with this format. Signed-off-by: Luben Tuikov Cc: Alexander Deucher Cc: Andrey Grodzovsky Cc: Christian König Cc: Daniel Vetter Cc: Lucas Stach Cc: Russell King Cc: Christian Gmeiner Cc: Qiang Yu Cc: Rob Herring Cc: Tomeu Vizoso Cc: Steven Price Cc: Alyssa Rosenzweig Cc: Eric Anholt v2: Dispell using a done thread, so as to keep the cache hot on the same processor. --- drivers/gpu/drm/scheduler/sched_main.c | 247 +++++++++++++------------ include/drm/gpu_scheduler.h | 4 + 2 files changed, 134 insertions(+), 117 deletions(-) diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index b9876cad94f2..d77180b44998 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -164,7 +164,9 @@ drm_sched_rq_select_entity(struct drm_sched_rq *rq) * drm_sched_job_done - complete a job * @s_job: pointer to the job which is done * - * Finish the job's fence and wake up the worker thread. + * Move the completed task to the done list, + * signal the its fence to mark it finished, + * and wake up the worker thread. */ static void drm_sched_job_done(struct drm_sched_job *s_job) { @@ -176,9 +178,14 @@ static void drm_sched_job_done(struct drm_sched_job *s_job) trace_drm_sched_process_job(s_fence); + spin_lock(&sched->job_list_lock); + list_move(&s_job->list, &sched->done_list); + spin_unlock(&sched->job_list_lock); + dma_fence_get(&s_fence->finished); drm_sched_fence_finished(s_fence); dma_fence_put(&s_fence->finished); + wake_up_interruptible(&sched->wake_up_worker); } @@ -309,6 +316,37 @@ static void drm_sched_job_begin(struct drm_sched_job *s_job) spin_unlock(&sched->job_list_lock); } +/** drm_sched_job_timeout -- a timer timeout occurred + * @work: pointer to work_struct + * + * First, park the scheduler thread whose IP timed out, + * so that we don't race with the scheduler thread pushing + * jobs down the IP as we try to investigate what + * happened and give drivers a chance to recover. + * + * Second, take the fist job in the pending list + * (oldest), leave it in the pending list and call the + * driver's timer timeout callback to find out what + * happened, passing this job as the suspect one. + * + * The driver may return DRM_TASK_STATUS COMPLETE, + * which means the task is not in the IP(*) and we move + * it to the done list to free it. + * + * (*) A reason for this would be, say, that the job + * completed in due time, or the driver has aborted + * this job using driver specific methods in the + * timedout_job callback and has now removed it from + * the hardware. + * + * Or, the driver may return DRM_TASK_STATUS_ALIVE, to + * indicate that it had inquired about this job, and it + * has verified that this job is alive and well, and + * that the DRM layer should give this task more time + * to complete. In this case, we restart the timeout timer. + * + * Lastly, we unpark the scheduler thread. + */ static void drm_sched_job_timedout(struct work_struct *work) { struct drm_gpu_scheduler *sched; @@ -316,37 +354,32 @@ static void drm_sched_job_timedout(struct work_struct *work) sched = container_of(work, struct drm_gpu_scheduler, work_tdr.work); - /* Protects against concurrent deletion in drm_sched_get_cleanup_job */ + kthread_park(sched->thread); + spin_lock(&sched->job_list_lock); job = list_first_entry_or_null(&sched->pending_list, struct drm_sched_job, list); + spin_unlock(&sched->job_list_lock); if (job) { - /* - * Remove the bad job so it cannot be freed by concurrent - * drm_sched_cleanup_jobs. It will be reinserted back after sched->thread - * is parked at which point it's safe. - */ - list_del_init(&job->list); - spin_unlock(&sched->job_list_lock); - - job->sched->ops->timedout_job(job); + int res; - /* - * Guilty job did complete and hence needs to be manually removed - * See drm_sched_stop doc. - */ - if (sched->free_guilty) { - job->sched->ops->free_job(job); - sched->free_guilty = false; + res = job->sched->ops->timedout_job(job); + if (res == DRM_TASK_STATUS_COMPLETE) { + /* The job is out of the device. + */ + spin_lock(&sched->job_list_lock); + list_move(&job->list, &sched->done_list); + spin_unlock(&sched->job_list_lock); + wake_up_interruptible(&sched->wake_up_worker); + } else { + /* The job needs more time. + */ + drm_sched_start_timeout(sched); } - } else { - spin_unlock(&sched->job_list_lock); } - spin_lock(&sched->job_list_lock); - drm_sched_start_timeout(sched); - spin_unlock(&sched->job_list_lock); + kthread_unpark(sched->thread); } /** @@ -413,24 +446,13 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad) kthread_park(sched->thread); /* - * Reinsert back the bad job here - now it's safe as - * drm_sched_get_cleanup_job cannot race against us and release the - * bad job at this point - we parked (waited for) any in progress - * (earlier) cleanups and drm_sched_get_cleanup_job will not be called - * now until the scheduler thread is unparked. - */ - if (bad && bad->sched == sched) - /* - * Add at the head of the queue to reflect it was the earliest - * job extracted. - */ - list_add(&bad->list, &sched->pending_list); - - /* - * Iterate the job list from later to earlier one and either deactive - * their HW callbacks or remove them from pending list if they already - * signaled. - * This iteration is thread safe as sched thread is stopped. + * Iterate the pending list in reverse order, + * from most recently submitted to oldest + * tasks. Tasks which haven't completed, leave + * them in the pending list, but decrement + * their hardware run queue count. + * Else, the fence must've signalled, and the job + * is in the done list. */ list_for_each_entry_safe_reverse(s_job, tmp, &sched->pending_list, list) { @@ -439,36 +461,52 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad) &s_job->cb)) { atomic_dec(&sched->hw_rq_count); } else { - /* - * remove job from pending_list. - * Locking here is for concurrent resume timeout - */ - spin_lock(&sched->job_list_lock); - list_del_init(&s_job->list); - spin_unlock(&sched->job_list_lock); - - /* - * Wait for job's HW fence callback to finish using s_job - * before releasing it. - * - * Job is still alive so fence refcount at least 1 - */ - dma_fence_wait(&s_job->s_fence->finished, false); - - /* - * We must keep bad job alive for later use during - * recovery by some of the drivers but leave a hint - * that the guilty job must be released. - */ - if (bad != s_job) - sched->ops->free_job(s_job); - else - sched->free_guilty = true; + if (bad == s_job) { + /* This is the oldest job on the pending list + * whose IP timed out. The + * drm_sched_job_timeout() function calls the + * driver's timedout_job callback passing @bad, + * who then calls this function here--as such + * we shouldn't move @bad or free it. This will + * be decided by drm_sched_job_timeout() when + * this function here returns back to the caller + * (the driver) and the driver's timedout_job + * callback returns a result to + * drm_sched_job_timeout(). + */ + ; + } else { + int res; + + /* This job is not the @bad job passed above. + * Note that perhaps it was *this* job which + * timed out. The wait below is suspect. Since, + * it waits with maximum timeout and "intr" set + * to false, it will either return 0 indicating + * that the fence has signalled, or negative on + * error. What if, the whole IP is stuck and + * this ends up waiting forever? + * + * Wait for job's HW fence callback to finish + * using s_job before releasing it. + * + * Job is still alive so fence + * refcount at least 1 + */ + res = dma_fence_wait(&s_job->s_fence->finished, + false); + + if (res == 0) + sched->ops->free_job(s_job); + else + pr_err_once("%s: dma_fence_wait: %d\n", + sched->name, res); + } } } /* - * Stop pending timer in flight as we rearm it in drm_sched_start. This + * Stop pending timer in flight as we rearm it in drm_sched_start. This * avoids the pending timeout work in progress to fire right away after * this TDR finished and before the newly restarted jobs had a * chance to complete. @@ -511,8 +549,9 @@ void drm_sched_start(struct drm_gpu_scheduler *sched, bool full_recovery) else if (r) DRM_ERROR("fence add callback failed (%d)\n", r); - } else + } else { drm_sched_job_done(s_job); + } } if (full_recovery) { @@ -665,47 +704,6 @@ drm_sched_select_entity(struct drm_gpu_scheduler *sched) return entity; } -/** - * drm_sched_get_cleanup_job - fetch the next finished job to be destroyed - * - * @sched: scheduler instance - * - * Returns the next finished job from the pending list (if there is one) - * ready for it to be destroyed. - */ -static struct drm_sched_job * -drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched) -{ - struct drm_sched_job *job; - - /* - * Don't destroy jobs while the timeout worker is running OR thread - * is being parked and hence assumed to not touch pending_list - */ - if ((sched->timeout != MAX_SCHEDULE_TIMEOUT && - !cancel_delayed_work(&sched->work_tdr)) || - kthread_should_park()) - return NULL; - - spin_lock(&sched->job_list_lock); - - job = list_first_entry_or_null(&sched->pending_list, - struct drm_sched_job, list); - - if (job && dma_fence_is_signaled(&job->s_fence->finished)) { - /* remove job from pending_list */ - list_del_init(&job->list); - } else { - job = NULL; - /* queue timeout for next job */ - drm_sched_start_timeout(sched); - } - - spin_unlock(&sched->job_list_lock); - - return job; -} - /** * drm_sched_pick_best - Get a drm sched from a sched_list with the least load * @sched_list: list of drm_gpu_schedulers @@ -759,6 +757,25 @@ static bool drm_sched_blocked(struct drm_gpu_scheduler *sched) return false; } +static void drm_sched_free_done(struct drm_gpu_scheduler *sched) +{ + LIST_HEAD(done_q); + + spin_lock(&sched->job_list_lock); + list_splice_init(&sched->done_list, &done_q); + spin_unlock(&sched->job_list_lock); + + while (!list_empty(&done_q)) { + struct drm_sched_job *job; + + job = list_first_entry(&done_q, + struct drm_sched_job, + list); + list_del_init(&job->list); + sched->ops->free_job(job); + } +} + /** * drm_sched_main - main scheduler thread * @@ -768,7 +785,7 @@ static bool drm_sched_blocked(struct drm_gpu_scheduler *sched) */ static int drm_sched_main(void *param) { - struct drm_gpu_scheduler *sched = (struct drm_gpu_scheduler *)param; + struct drm_gpu_scheduler *sched = param; int r; sched_set_fifo_low(current); @@ -778,19 +795,14 @@ static int drm_sched_main(void *param) struct drm_sched_fence *s_fence; struct drm_sched_job *sched_job; struct dma_fence *fence; - struct drm_sched_job *cleanup_job = NULL; wait_event_interruptible(sched->wake_up_worker, - (cleanup_job = drm_sched_get_cleanup_job(sched)) || + (!list_empty(&sched->done_list)) || (!drm_sched_blocked(sched) && (entity = drm_sched_select_entity(sched))) || kthread_should_stop()); - if (cleanup_job) { - sched->ops->free_job(cleanup_job); - /* queue timeout for next job */ - drm_sched_start_timeout(sched); - } + drm_sched_free_done(sched); if (!entity) continue; @@ -864,6 +876,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched, init_waitqueue_head(&sched->wake_up_worker); init_waitqueue_head(&sched->job_scheduled); INIT_LIST_HEAD(&sched->pending_list); + INIT_LIST_HEAD(&sched->done_list); spin_lock_init(&sched->job_list_lock); atomic_set(&sched->hw_rq_count, 0); INIT_DELAYED_WORK(&sched->work_tdr, drm_sched_job_timedout); diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h index cedfc5394e52..11278695fed0 100644 --- a/include/drm/gpu_scheduler.h +++ b/include/drm/gpu_scheduler.h @@ -289,6 +289,7 @@ struct drm_gpu_scheduler { uint32_t hw_submission_limit; long timeout; const char *name; + struct drm_sched_rq sched_rq[DRM_SCHED_PRIORITY_COUNT]; wait_queue_head_t wake_up_worker; wait_queue_head_t job_scheduled; @@ -296,8 +297,11 @@ struct drm_gpu_scheduler { atomic64_t job_id_count; struct delayed_work work_tdr; struct task_struct *thread; + struct list_head pending_list; + struct list_head done_list; spinlock_t job_list_lock; + int hang_limit; atomic_t score; bool ready;