drm/nouveau: sched: fix leaking memory of timedout job

Message ID	20230916162835.5719-1-dakr@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Danilo Krummrich <dakr@redhat.com> To: nouveau@lists.freedesktop.org Subject: [PATCH] drm/nouveau: sched: fix leaking memory of timedout job Date: Sat, 16 Sep 2023 18:28:31 +0200 Message-ID: <20230916162835.5719-1-dakr@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="US-ASCII"; x-default=true Precedence: list Cc: Danilo Krummrich <dakr@redhat.com>, dri-devel@lists.freedesktop.org, kherbst@redhat.com Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	drm/nouveau: sched: fix leaking memory of timedout job \| expand drm/nouveau: sched: fix leaking memory of timedout job

Message ID

20230916162835.5719-1-dakr@redhat.com (mailing list archive)

State

New, archived

Headers

From: Danilo Krummrich <dakr@redhat.com>
To: nouveau@lists.freedesktop.org
Subject: [PATCH] drm/nouveau: sched: fix leaking memory of timedout job
Date: Sat, 16 Sep 2023 18:28:31 +0200
Message-ID: <20230916162835.5719-1-dakr@redhat.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset="US-ASCII"; x-default=true
Precedence: list
Cc: Danilo Krummrich <dakr@redhat.com>, dri-devel@lists.freedesktop.org,
 kherbst@redhat.com
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

Series

drm/nouveau: sched: fix leaking memory of timedout job | expand

Commit Message

Danilo Krummrich Sept. 16, 2023, 4:28 p.m. UTC

Always stop and re-start the scheduler in order to let the scheduler
free up the timedout job in case it got signaled. In case of exec jobs
the job type specific callback will take care to signal all fences and
tear down the channel.

Fixes: b88baab82871 ("drm/nouveau: implement new VM_BIND uAPI")
Signed-off-by: Danilo Krummrich <dakr@redhat.com>
---
 drivers/gpu/drm/nouveau/nouveau_exec.c  |  2 +-
 drivers/gpu/drm/nouveau/nouveau_sched.c | 12 +++++++++---
 2 files changed, 10 insertions(+), 4 deletions(-)

Comments

Lyude Paul Sept. 18, 2023, 10:19 p.m. UTC | #1

BTW - Would you like me to review work like this? I'm totally happy to do
that, although I'm not terribly familiar with these parts of nouveau/drm (but
I'm always willing to learn, and would like to know more about these areas
anyway :)

…if the answer is yes, this patch looks fine to me so far - I guess the one
question I have that might have an obvious answer - how are jobs without an
job->ops->timeout callback cleaned up?

On Sat, 2023-09-16 at 18:28 +0200, Danilo Krummrich wrote:
> Always stop and re-start the scheduler in order to let the scheduler
> free up the timedout job in case it got signaled. In case of exec jobs
> the job type specific callback will take care to signal all fences and
> tear down the channel.
> 
> Fixes: b88baab82871 ("drm/nouveau: implement new VM_BIND uAPI")
> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
> ---
>  drivers/gpu/drm/nouveau/nouveau_exec.c  |  2 +-
>  drivers/gpu/drm/nouveau/nouveau_sched.c | 12 +++++++++---
>  2 files changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_exec.c b/drivers/gpu/drm/nouveau/nouveau_exec.c
> index 9c031d15fe0b..49d83ac9e036 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_exec.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_exec.c
> @@ -185,7 +185,7 @@ nouveau_exec_job_timeout(struct nouveau_job *job)
>  
>  	nouveau_sched_entity_fini(job->entity);
>  
> -	return DRM_GPU_SCHED_STAT_ENODEV;
> +	return DRM_GPU_SCHED_STAT_NOMINAL;
>  }
>  
>  static struct nouveau_job_ops nouveau_exec_job_ops = {
> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
> index 88217185e0f3..3b7ea5221226 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
> @@ -375,14 +375,20 @@ nouveau_sched_run_job(struct drm_sched_job *sched_job)
>  static enum drm_gpu_sched_stat
>  nouveau_sched_timedout_job(struct drm_sched_job *sched_job)
>  {
> +	struct drm_gpu_scheduler *sched = sched_job->sched;
>  	struct nouveau_job *job = to_nouveau_job(sched_job);
> +	enum drm_gpu_sched_stat stat = DRM_GPU_SCHED_STAT_NOMINAL;
>  
> -	NV_PRINTK(warn, job->cli, "Job timed out.\n");
> +	drm_sched_stop(sched, sched_job);
>  
>  	if (job->ops->timeout)
> -		return job->ops->timeout(job);
> +		stat = job->ops->timeout(job);
> +	else
> +		NV_PRINTK(warn, job->cli, "Generic job timeout.\n");
> +
> +	drm_sched_start(sched, true);
>  
> -	return DRM_GPU_SCHED_STAT_ENODEV;
> +	return stat;
>  }
>  
>  static void

Danilo Krummrich Sept. 18, 2023, 10:54 p.m. UTC | #2

On 9/19/23 00:19, Lyude Paul wrote:
> BTW - Would you like me to review work like this? I'm totally happy to do
> that, although I'm not terribly familiar with these parts of nouveau/drm (but
> I'm always willing to learn, and would like to know more about these areas
> anyway :)

Of course, that's absolutely welcome. :)

> 
> …if the answer is yes, this patch looks fine to me so far - I guess the one
> question I have that might have an obvious answer - how are jobs without an
> job->ops->timeout callback cleaned up?

Currently there are only two types of jobs, EXEC jobs and VM_BIND jobs.
EXEC jobs do have the callback, where on timeout the fence context is killed and
hence pending fences for this channel are signaled with an error code. Queued up
jobs from the channel's corresponding entity are discarded.

VM_BIND jobs still execute on the CPU and can't really time out. Well, technically, they
actually could, since drm_sched_job_begin() is called before the run_job() callback.
However, this would require mapping / unmapping stuff taking an absolutely unreasonable
amount of time. But even then it'd be safe because drm_sched_stop() would block until
the run_job() callback returns and subsequently the job would be cleaned up by
drm_sched_stop() since we'd pass it to drm_sched_stop() as 'bad' job.

However, I'm also working on a series [1] to make use of the DRM_SCHED_POLICY_SINGLE_ENTITY
work from Matt [2] in order to properly support the firmware scheduler.

[1] https://gitlab.freedesktop.org/nouvelles/kernel/-/commits/sched-single-entity/
[2] https://lore.kernel.org/dri-devel/20230912021615.2086698-1-matthew.brost@intel.com/

> 
> On Sat, 2023-09-16 at 18:28 +0200, Danilo Krummrich wrote:
>> Always stop and re-start the scheduler in order to let the scheduler
>> free up the timedout job in case it got signaled. In case of exec jobs
>> the job type specific callback will take care to signal all fences and
>> tear down the channel.
>>
>> Fixes: b88baab82871 ("drm/nouveau: implement new VM_BIND uAPI")
>> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
>> ---
>>   drivers/gpu/drm/nouveau/nouveau_exec.c  |  2 +-
>>   drivers/gpu/drm/nouveau/nouveau_sched.c | 12 +++++++++---
>>   2 files changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_exec.c b/drivers/gpu/drm/nouveau/nouveau_exec.c
>> index 9c031d15fe0b..49d83ac9e036 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_exec.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_exec.c
>> @@ -185,7 +185,7 @@ nouveau_exec_job_timeout(struct nouveau_job *job)
>>   
>>   	nouveau_sched_entity_fini(job->entity);
>>   
>> -	return DRM_GPU_SCHED_STAT_ENODEV;
>> +	return DRM_GPU_SCHED_STAT_NOMINAL;
>>   }
>>   
>>   static struct nouveau_job_ops nouveau_exec_job_ops = {
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
>> index 88217185e0f3..3b7ea5221226 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
>> @@ -375,14 +375,20 @@ nouveau_sched_run_job(struct drm_sched_job *sched_job)
>>   static enum drm_gpu_sched_stat
>>   nouveau_sched_timedout_job(struct drm_sched_job *sched_job)
>>   {
>> +	struct drm_gpu_scheduler *sched = sched_job->sched;
>>   	struct nouveau_job *job = to_nouveau_job(sched_job);
>> +	enum drm_gpu_sched_stat stat = DRM_GPU_SCHED_STAT_NOMINAL;
>>   
>> -	NV_PRINTK(warn, job->cli, "Job timed out.\n");
>> +	drm_sched_stop(sched, sched_job);
>>   
>>   	if (job->ops->timeout)
>> -		return job->ops->timeout(job);
>> +		stat = job->ops->timeout(job);
>> +	else
>> +		NV_PRINTK(warn, job->cli, "Generic job timeout.\n");
>> +
>> +	drm_sched_start(sched, true);
>>   
>> -	return DRM_GPU_SCHED_STAT_ENODEV;
>> +	return stat;
>>   }
>>   
>>   static void
>

Dave Airlie Sept. 19, 2023, 2:13 a.m. UTC | #3

On Sun, 17 Sept 2023 at 02:28, Danilo Krummrich <dakr@redhat.com> wrote:
>
> Always stop and re-start the scheduler in order to let the scheduler
> free up the timedout job in case it got signaled. In case of exec jobs
> the job type specific callback will take care to signal all fences and
> tear down the channel.

Reviewed-by: Dave Airlie <airlied@redhat.com>
>
> Fixes: b88baab82871 ("drm/nouveau: implement new VM_BIND uAPI")
> Signed-off-by: Danilo Krummrich <dakr@redhat.com>
> ---
>  drivers/gpu/drm/nouveau/nouveau_exec.c  |  2 +-
>  drivers/gpu/drm/nouveau/nouveau_sched.c | 12 +++++++++---
>  2 files changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/nouveau/nouveau_exec.c b/drivers/gpu/drm/nouveau/nouveau_exec.c
> index 9c031d15fe0b..49d83ac9e036 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_exec.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_exec.c
> @@ -185,7 +185,7 @@ nouveau_exec_job_timeout(struct nouveau_job *job)
>
>         nouveau_sched_entity_fini(job->entity);
>
> -       return DRM_GPU_SCHED_STAT_ENODEV;
> +       return DRM_GPU_SCHED_STAT_NOMINAL;
>  }
>
>  static struct nouveau_job_ops nouveau_exec_job_ops = {
> diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
> index 88217185e0f3..3b7ea5221226 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_sched.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
> @@ -375,14 +375,20 @@ nouveau_sched_run_job(struct drm_sched_job *sched_job)
>  static enum drm_gpu_sched_stat
>  nouveau_sched_timedout_job(struct drm_sched_job *sched_job)
>  {
> +       struct drm_gpu_scheduler *sched = sched_job->sched;
>         struct nouveau_job *job = to_nouveau_job(sched_job);
> +       enum drm_gpu_sched_stat stat = DRM_GPU_SCHED_STAT_NOMINAL;
>
> -       NV_PRINTK(warn, job->cli, "Job timed out.\n");
> +       drm_sched_stop(sched, sched_job);
>
>         if (job->ops->timeout)
> -               return job->ops->timeout(job);
> +               stat = job->ops->timeout(job);
> +       else
> +               NV_PRINTK(warn, job->cli, "Generic job timeout.\n");
> +
> +       drm_sched_start(sched, true);
>
> -       return DRM_GPU_SCHED_STAT_ENODEV;
> +       return stat;
>  }
>
>  static void
> --
> 2.41.0
>

diff --git a/drivers/gpu/drm/nouveau/nouveau_exec.c b/drivers/gpu/drm/nouveau/nouveau_exec.c
index 9c031d15fe0b..49d83ac9e036 100644
--- a/drivers/gpu/drm/nouveau/nouveau_exec.c
+++ b/drivers/gpu/drm/nouveau/nouveau_exec.c
@@ -185,7 +185,7 @@  nouveau_exec_job_timeout(struct nouveau_job *job)
 
 	nouveau_sched_entity_fini(job->entity);
 
-	return DRM_GPU_SCHED_STAT_ENODEV;
+	return DRM_GPU_SCHED_STAT_NOMINAL;
 }
 
 static struct nouveau_job_ops nouveau_exec_job_ops = {
diff --git a/drivers/gpu/drm/nouveau/nouveau_sched.c b/drivers/gpu/drm/nouveau/nouveau_sched.c
index 88217185e0f3..3b7ea5221226 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sched.c
+++ b/drivers/gpu/drm/nouveau/nouveau_sched.c
@@ -375,14 +375,20 @@  nouveau_sched_run_job(struct drm_sched_job *sched_job)
 static enum drm_gpu_sched_stat
 nouveau_sched_timedout_job(struct drm_sched_job *sched_job)
 {
+	struct drm_gpu_scheduler *sched = sched_job->sched;
 	struct nouveau_job *job = to_nouveau_job(sched_job);
+	enum drm_gpu_sched_stat stat = DRM_GPU_SCHED_STAT_NOMINAL;
 
-	NV_PRINTK(warn, job->cli, "Job timed out.\n");
+	drm_sched_stop(sched, sched_job);
 
 	if (job->ops->timeout)
-		return job->ops->timeout(job);
+		stat = job->ops->timeout(job);
+	else
+		NV_PRINTK(warn, job->cli, "Generic job timeout.\n");
+
+	drm_sched_start(sched, true);
 
-	return DRM_GPU_SCHED_STAT_ENODEV;
+	return stat;
 }
 
 static void

drm/nouveau: sched: fix leaking memory of timedout job

Commit Message

Comments

Patch