mbox series

[RFC,0/4] DRM scheduler fixes, or not, or incorrect kind

Message ID 20240906180618.12180-1-tursulin@igalia.com (mailing list archive)
Headers show
Series DRM scheduler fixes, or not, or incorrect kind | expand

Message

Tvrtko Ursulin Sept. 6, 2024, 6:06 p.m. UTC
From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>

In a recent conversation with Christian there was a thought that
drm_sched_entity_modify_sched() should start using the entity->rq_lock to be
safe against job submission and simultaneous priority changes.

The kerneldoc accompanying that function however is a bit unclear to me. For
instance is amdgpu simply doing it wrongly by not serializing the two in the
driver? Or is the comment referring to some other race condition than which is
of concern in this series?

To cut the long story short, first three patches try to fix this race in three
places I *think* can manifest in different ways.

Last patch is a trivial optimisation I spotted can be easily done.

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Luben Tuikov <ltuikov89@gmail.com>
Cc: Matthew Brost <matthew.brost@intel.com>

Tvrtko Ursulin (4):
  drm/sched: Add locking to drm_sched_entity_modify_sched
  drm/sched: Always wake up correct scheduler in
    drm_sched_entity_push_job
  drm/sched: Always increment correct scheduler score
  drm/sched: Optimise drm_sched_entity_push_job

 drivers/gpu/drm/scheduler/sched_entity.c | 17 ++++++++++++-----
 drivers/gpu/drm/scheduler/sched_main.c   | 21 ++++++++++++++-------
 include/drm/gpu_scheduler.h              |  1 +
 3 files changed, 27 insertions(+), 12 deletions(-)

Comments

Philipp Stanner Sept. 9, 2024, 8:47 a.m. UTC | #1
Hi,

On Fri, 2024-09-06 at 19:06 +0100, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> 
> In a recent conversation with Christian there was a thought that
> drm_sched_entity_modify_sched() should start using the entity-
> >rq_lock to be
> safe against job submission and simultaneous priority changes.

There are also FIXMEs in gpu_scheduler.h that might be related.

> 
> The kerneldoc accompanying that function however is a bit unclear to
> me. For
> instance is amdgpu simply doing it wrongly by not serializing the two
> in the
> driver? Or is the comment referring to some other race condition than
> which is
> of concern in this series?
> 
> To cut the long story short, first three patches try to fix this race
> in three
> places I *think* can manifest in different ways.
> 
> Last patch is a trivial optimisation I spotted can be easily done.

I took a look and at least to me it doesn't appear to be that trivial,
mostly because it takes two locks.

Would you mind branching that out as a separate patch so that the
series would 100% address bugs?

P.	

> 
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Luben Tuikov <ltuikov89@gmail.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> 
> Tvrtko Ursulin (4):
>   drm/sched: Add locking to drm_sched_entity_modify_sched
>   drm/sched: Always wake up correct scheduler in
>     drm_sched_entity_push_job
>   drm/sched: Always increment correct scheduler score
>   drm/sched: Optimise drm_sched_entity_push_job
> 
>  drivers/gpu/drm/scheduler/sched_entity.c | 17 ++++++++++++-----
>  drivers/gpu/drm/scheduler/sched_main.c   | 21 ++++++++++++++-------
>  include/drm/gpu_scheduler.h              |  1 +
>  3 files changed, 27 insertions(+), 12 deletions(-)
>
Tvrtko Ursulin Sept. 9, 2024, 9:20 a.m. UTC | #2
On 09/09/2024 09:47, Philipp Stanner wrote:
> Hi,
> 
> On Fri, 2024-09-06 at 19:06 +0100, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>>
>> In a recent conversation with Christian there was a thought that
>> drm_sched_entity_modify_sched() should start using the entity-
>>> rq_lock to be
>> safe against job submission and simultaneous priority changes.
> 
> There are also FIXMEs in gpu_scheduler.h that might be related.

Yes there appears to be a good number of dodgy/unclear areas there.

>> The kerneldoc accompanying that function however is a bit unclear to
>> me. For
>> instance is amdgpu simply doing it wrongly by not serializing the two
>> in the
>> driver? Or is the comment referring to some other race condition than
>> which is
>> of concern in this series?
>>
>> To cut the long story short, first three patches try to fix this race
>> in three
>> places I *think* can manifest in different ways.
>>
>> Last patch is a trivial optimisation I spotted can be easily done.
> 
> I took a look and at least to me it doesn't appear to be that trivial,
> mostly because it takes two locks.

The code does take two locks, but the patch itself does not change any 
of that. It just splits the locked helper out so re-locking is avoided.

> Would you mind branching that out as a separate patch so that the
> series would 100% address bugs?

I wanted to start the series with fixes so backporting will work. If I 
put the optimisation first then fixes will not trivially apply to older 
kernels. Assuming they are correct of course.

Regards,

Tvrtko

> 
> P.	
> 
>>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: Alex Deucher <alexander.deucher@amd.com>
>> Cc: Luben Tuikov <ltuikov89@gmail.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>>
>> Tvrtko Ursulin (4):
>>    drm/sched: Add locking to drm_sched_entity_modify_sched
>>    drm/sched: Always wake up correct scheduler in
>>      drm_sched_entity_push_job
>>    drm/sched: Always increment correct scheduler score
>>    drm/sched: Optimise drm_sched_entity_push_job
>>
>>   drivers/gpu/drm/scheduler/sched_entity.c | 17 ++++++++++++-----
>>   drivers/gpu/drm/scheduler/sched_main.c   | 21 ++++++++++++++-------
>>   include/drm/gpu_scheduler.h              |  1 +
>>   3 files changed, 27 insertions(+), 12 deletions(-)
>>
>
Matthew Brost Sept. 9, 2024, 4:08 p.m. UTC | #3
On Fri, Sep 06, 2024 at 07:06:14PM +0100, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> 
> In a recent conversation with Christian there was a thought that
> drm_sched_entity_modify_sched() should start using the entity->rq_lock to be
> safe against job submission and simultaneous priority changes.
> 
> The kerneldoc accompanying that function however is a bit unclear to me. For
> instance is amdgpu simply doing it wrongly by not serializing the two in the
> driver? Or is the comment referring to some other race condition than which is
> of concern in this series?
> 
> To cut the long story short, first three patches try to fix this race in three
> places I *think* can manifest in different ways.
> 
> Last patch is a trivial optimisation I spotted can be easily done.
> 
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Luben Tuikov <ltuikov89@gmail.com>
> Cc: Matthew Brost <matthew.brost@intel.com>

It seems like the series is getting traction on reviews so don't have
much to add series looks sane enough to me. Consider series acked by me
on whatever you hash out in reviews.

Matt

> 
> Tvrtko Ursulin (4):
>   drm/sched: Add locking to drm_sched_entity_modify_sched
>   drm/sched: Always wake up correct scheduler in
>     drm_sched_entity_push_job
>   drm/sched: Always increment correct scheduler score
>   drm/sched: Optimise drm_sched_entity_push_job
> 
>  drivers/gpu/drm/scheduler/sched_entity.c | 17 ++++++++++++-----
>  drivers/gpu/drm/scheduler/sched_main.c   | 21 ++++++++++++++-------
>  include/drm/gpu_scheduler.h              |  1 +
>  3 files changed, 27 insertions(+), 12 deletions(-)
> 
> -- 
> 2.46.0
>