Message ID | 20240913165326.8856-1-robdclark@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/sched: Fix dynamic job-flow control race | expand |
On 2024-09-13 18:53, Rob Clark wrote: > From: Rob Clark <robdclark@chromium.org> > > Fixes a race condition reported here: https://github.com/AsahiLinux/linux/issues/309#issuecomment-2238968609 > > The whole premise of lockless access to a single-producer-single- > consumer queue is that there is just a single producer and single > consumer. That means we can't call drm_sched_can_queue() (which is > about queueing more work to the hw, not to the spsc queue) from > anywhere other than the consumer (wq). > > This call in the producer is just an optimization to avoid scheduling > the consuming worker if it cannot yet queue more work to the hw. It > is safe to drop this optimization to avoid the race condition. > > Suggested-by: Asahi Lina <lina@asahilina.net> > Fixes: a78422e9dff3 ("drm/sched: implement dynamic job-flow control") > Signed-off-by: Rob Clark <robdclark@chromium.org> > --- > drivers/gpu/drm/scheduler/sched_main.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index ab53ab486fe6..1af1dbe757d5 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -1020,8 +1020,7 @@ EXPORT_SYMBOL(drm_sched_job_cleanup); > void drm_sched_wakeup(struct drm_gpu_scheduler *sched, > struct drm_sched_entity *entity) > { > - if (drm_sched_can_queue(sched, entity)) > - drm_sched_run_job_queue(sched); > + drm_sched_run_job_queue(sched); > } > > /** The entity parameter is unused now.
On Fri, Sep 13, 2024 at 10:03 AM Michel Dänzer <michel.daenzer@mailbox.org> wrote: > > On 2024-09-13 18:53, Rob Clark wrote: > > From: Rob Clark <robdclark@chromium.org> > > > > Fixes a race condition reported here: https://github.com/AsahiLinux/linux/issues/309#issuecomment-2238968609 > > > > The whole premise of lockless access to a single-producer-single- > > consumer queue is that there is just a single producer and single > > consumer. That means we can't call drm_sched_can_queue() (which is > > about queueing more work to the hw, not to the spsc queue) from > > anywhere other than the consumer (wq). > > > > This call in the producer is just an optimization to avoid scheduling > > the consuming worker if it cannot yet queue more work to the hw. It > > is safe to drop this optimization to avoid the race condition. > > > > Suggested-by: Asahi Lina <lina@asahilina.net> > > Fixes: a78422e9dff3 ("drm/sched: implement dynamic job-flow control") > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > --- > > drivers/gpu/drm/scheduler/sched_main.c | 3 +-- > > 1 file changed, 1 insertion(+), 2 deletions(-) > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > > index ab53ab486fe6..1af1dbe757d5 100644 > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > @@ -1020,8 +1020,7 @@ EXPORT_SYMBOL(drm_sched_job_cleanup); > > void drm_sched_wakeup(struct drm_gpu_scheduler *sched, > > struct drm_sched_entity *entity) > > { > > - if (drm_sched_can_queue(sched, entity)) > > - drm_sched_run_job_queue(sched); > > + drm_sched_run_job_queue(sched); > > } > > > > /** > > The entity parameter is unused now. Right.. and we probably should collapse drm_sched_wakeup() and drm_sched_run_job_queue().. But this fix needs to be cherry picked back to a bunch of release branches, so I intentionally avoided refactoring as part of the fix. BR, -R > > > -- > Earthling Michel Dänzer \ GNOME / Xwayland / Mesa developer > https://redhat.com \ Libre software enthusiast
On Fri, Sep 13, 2024 at 09:53:25AM -0700, Rob Clark wrote: > From: Rob Clark <robdclark@chromium.org> > > Fixes a race condition reported here: https://github.com/AsahiLinux/linux/issues/309#issuecomment-2238968609 Good catch! Please add a 'Closes' tag with this link. > > The whole premise of lockless access to a single-producer-single- > consumer queue is that there is just a single producer and single > consumer. That means we can't call drm_sched_can_queue() (which is > about queueing more work to the hw, not to the spsc queue) from > anywhere other than the consumer (wq). > > This call in the producer is just an optimization to avoid scheduling > the consuming worker if it cannot yet queue more work to the hw. It > is safe to drop this optimization to avoid the race condition. > > Suggested-by: Asahi Lina <lina@asahilina.net> > Fixes: a78422e9dff3 ("drm/sched: implement dynamic job-flow control") You may want to explicitly CC stable. > Signed-off-by: Rob Clark <robdclark@chromium.org> > --- > drivers/gpu/drm/scheduler/sched_main.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index ab53ab486fe6..1af1dbe757d5 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -1020,8 +1020,7 @@ EXPORT_SYMBOL(drm_sched_job_cleanup); > void drm_sched_wakeup(struct drm_gpu_scheduler *sched, > struct drm_sched_entity *entity) Please also remove the entity parameter. For the other refactoring, I agree it should be in a different patch. With that, Reviewed-by: Danilo Krummrich <dakr@kernel.org> Thanks for fixing this. - Danilo > { > - if (drm_sched_can_queue(sched, entity)) > - drm_sched_run_job_queue(sched); > + drm_sched_run_job_queue(sched); > } > > /** > -- > 2.46.0 >
On 2024-09-13 19:28, Rob Clark wrote: > On Fri, Sep 13, 2024 at 10:03 AM Michel Dänzer > <michel.daenzer@mailbox.org> wrote: >> >> On 2024-09-13 18:53, Rob Clark wrote: >>> From: Rob Clark <robdclark@chromium.org> >>> >>> Fixes a race condition reported here: https://github.com/AsahiLinux/linux/issues/309#issuecomment-2238968609 >>> >>> The whole premise of lockless access to a single-producer-single- >>> consumer queue is that there is just a single producer and single >>> consumer. That means we can't call drm_sched_can_queue() (which is >>> about queueing more work to the hw, not to the spsc queue) from >>> anywhere other than the consumer (wq). >>> >>> This call in the producer is just an optimization to avoid scheduling >>> the consuming worker if it cannot yet queue more work to the hw. It >>> is safe to drop this optimization to avoid the race condition. >>> >>> Suggested-by: Asahi Lina <lina@asahilina.net> >>> Fixes: a78422e9dff3 ("drm/sched: implement dynamic job-flow control") >>> Signed-off-by: Rob Clark <robdclark@chromium.org> >>> --- >>> drivers/gpu/drm/scheduler/sched_main.c | 3 +-- >>> 1 file changed, 1 insertion(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >>> index ab53ab486fe6..1af1dbe757d5 100644 >>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>> @@ -1020,8 +1020,7 @@ EXPORT_SYMBOL(drm_sched_job_cleanup); >>> void drm_sched_wakeup(struct drm_gpu_scheduler *sched, >>> struct drm_sched_entity *entity) >>> { >>> - if (drm_sched_can_queue(sched, entity)) >>> - drm_sched_run_job_queue(sched); >>> + drm_sched_run_job_queue(sched); >>> } >>> >>> /** >> >> The entity parameter is unused now. > > Right.. and we probably should collapse drm_sched_wakeup() and > drm_sched_run_job_queue().. I looked into that as well, seems fine to leave them separate for now though. > But this fix needs to be cherry picked back to a bunch of release > branches, so I intentionally avoided refactoring as part of the fix. Fixing up the drm_sched_wakeup caller(s) when backporting doesn't seem like a big deal.
On Fri, 2024-09-13 at 09:53 -0700, Rob Clark wrote: > From: Rob Clark <robdclark@chromium.org> > > Fixes a race condition reported here: > https://github.com/AsahiLinux/linux/issues/309#issuecomment-2238968609 As Danilo suggested before, I'd put this in a Fixes: section at the bottom and instead have a sentence here detailing what the race consists of, i.e., who is racing with whom. P. > > The whole premise of lockless access to a single-producer-single- > consumer queue is that there is just a single producer and single > consumer. That means we can't call drm_sched_can_queue() (which is > about queueing more work to the hw, not to the spsc queue) from > anywhere other than the consumer (wq). > > This call in the producer is just an optimization to avoid scheduling > the consuming worker if it cannot yet queue more work to the hw. It > is safe to drop this optimization to avoid the race condition. > > Suggested-by: Asahi Lina <lina@asahilina.net> > Fixes: a78422e9dff3 ("drm/sched: implement dynamic job-flow control") > Signed-off-by: Rob Clark <robdclark@chromium.org> > --- > drivers/gpu/drm/scheduler/sched_main.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > b/drivers/gpu/drm/scheduler/sched_main.c > index ab53ab486fe6..1af1dbe757d5 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -1020,8 +1020,7 @@ EXPORT_SYMBOL(drm_sched_job_cleanup); > void drm_sched_wakeup(struct drm_gpu_scheduler *sched, > struct drm_sched_entity *entity) > { > - if (drm_sched_can_queue(sched, entity)) > - drm_sched_run_job_queue(sched); > + drm_sched_run_job_queue(sched); > } > > /**
On Mon, Sep 16, 2024 at 10:21:21AM +0200, Philipp Stanner wrote: > On Fri, 2024-09-13 at 09:53 -0700, Rob Clark wrote: > > From: Rob Clark <robdclark@chromium.org> > > > > Fixes a race condition reported here: > > https://github.com/AsahiLinux/linux/issues/309#issuecomment-2238968609 > > As Danilo suggested before, I'd put this in a Fixes: section at the I think you mean 'Closes'. Please note that there's a v2 [1] already, which does that. [1] https://lore.kernel.org/dri-devel/20240913202301.16772-1-robdclark@gmail.com/ > bottom and instead have a sentence here detailing what the race > consists of, i.e., who is racing with whom. That's written right below, isn't it? > > P. > > > > > The whole premise of lockless access to a single-producer-single- > > consumer queue is that there is just a single producer and single > > consumer. That means we can't call drm_sched_can_queue() (which is > > about queueing more work to the hw, not to the spsc queue) from > > anywhere other than the consumer (wq). > > > > This call in the producer is just an optimization to avoid scheduling > > the consuming worker if it cannot yet queue more work to the hw. It > > is safe to drop this optimization to avoid the race condition. > > > > Suggested-by: Asahi Lina <lina@asahilina.net> > > Fixes: a78422e9dff3 ("drm/sched: implement dynamic job-flow control") > > Signed-off-by: Rob Clark <robdclark@chromium.org> > > --- > > drivers/gpu/drm/scheduler/sched_main.c | 3 +-- > > 1 file changed, 1 insertion(+), 2 deletions(-) > > > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c > > b/drivers/gpu/drm/scheduler/sched_main.c > > index ab53ab486fe6..1af1dbe757d5 100644 > > --- a/drivers/gpu/drm/scheduler/sched_main.c > > +++ b/drivers/gpu/drm/scheduler/sched_main.c > > @@ -1020,8 +1020,7 @@ EXPORT_SYMBOL(drm_sched_job_cleanup); > > void drm_sched_wakeup(struct drm_gpu_scheduler *sched, > > struct drm_sched_entity *entity) > > { > > - if (drm_sched_can_queue(sched, entity)) > > - drm_sched_run_job_queue(sched); > > + drm_sched_run_job_queue(sched); > > } > > > > /** >
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c index ab53ab486fe6..1af1dbe757d5 100644 --- a/drivers/gpu/drm/scheduler/sched_main.c +++ b/drivers/gpu/drm/scheduler/sched_main.c @@ -1020,8 +1020,7 @@ EXPORT_SYMBOL(drm_sched_job_cleanup); void drm_sched_wakeup(struct drm_gpu_scheduler *sched, struct drm_sched_entity *entity) { - if (drm_sched_can_queue(sched, entity)) - drm_sched_run_job_queue(sched); + drm_sched_run_job_queue(sched); } /**