nouveau: offload fence uevents work to workqueue

Message ID	20240129015053.1687418-1-airlied@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Dave Airlie <airlied@gmail.com> To: dri-devel@lists.freedesktop.org Subject: [PATCH] nouveau: offload fence uevents work to workqueue Date: Mon, 29 Jan 2024 11:50:53 +1000 Message-ID: <20240129015053.1687418-1-airlied@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=WINDOWS-1252; x-default=true Precedence: list Cc: nouveau@lists.freedesktop.org Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	nouveau: offload fence uevents work to workqueue \| expand nouveau: offload fence uevents work to workqueue

Dave Airlie Jan. 29, 2024, 1:50 a.m. UTC

From: Dave Airlie <airlied@redhat.com>

This should break the deadlock between the fctx lock and the irq lock.

This offloads the processing off the work from the irq into a workqueue.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/nouveau/nouveau_fence.c | 24 ++++++++++++++++++------
 drivers/gpu/drm/nouveau/nouveau_fence.h |  1 +
 2 files changed, 19 insertions(+), 6 deletions(-)

Danilo Krummrich Feb. 5, 2024, 4:22 p.m. UTC | #1

On 1/29/24 02:50, Dave Airlie wrote:
> From: Dave Airlie <airlied@redhat.com>
> 
> This should break the deadlock between the fctx lock and the irq lock.
> 
> This offloads the processing off the work from the irq into a workqueue.
> 
> Signed-off-by: Dave Airlie <airlied@redhat.com>

Nouveau's scheduler uses a dedicated wq, hence from this perspective it's
safe deferring fence signalling to the kernel global wq. However, I wonder
if we could create deadlocks by building dependency chains into other
drivers / kernel code that, by chance, makes use of the kernel global wq as
well.

Admittedly, even if, it's gonna be extremely unlikely given that
WQ_MAX_ACTIVE == 512. But maybe it'd be safer to use a dedicated wq.

Also, do we need to CC stable?

> ---
>   drivers/gpu/drm/nouveau/nouveau_fence.c | 24 ++++++++++++++++++------
>   drivers/gpu/drm/nouveau/nouveau_fence.h |  1 +
>   2 files changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.c b/drivers/gpu/drm/nouveau/nouveau_fence.c
> index ca762ea55413..93f08f9479d8 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_fence.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c
> @@ -103,6 +103,7 @@ nouveau_fence_context_kill(struct nouveau_fence_chan *fctx, int error)
>   void
>   nouveau_fence_context_del(struct nouveau_fence_chan *fctx)
>   {
> +	cancel_work_sync(&fctx->uevent_work);
>   	nouveau_fence_context_kill(fctx, 0);
>   	nvif_event_dtor(&fctx->event);
>   	fctx->dead = 1;
> @@ -145,12 +146,13 @@ nouveau_fence_update(struct nouveau_channel *chan, struct nouveau_fence_chan *fc
>   	return drop;
>   }
>   
> -static int
> -nouveau_fence_wait_uevent_handler(struct nvif_event *event, void *repv, u32 repc)
> +static void
> +nouveau_fence_uevent_work(struct work_struct *work)
>   {
> -	struct nouveau_fence_chan *fctx = container_of(event, typeof(*fctx), event);
> +	struct nouveau_fence_chan *fctx = container_of(work, struct nouveau_fence_chan,
> +						       uevent_work);
>   	unsigned long flags;
> -	int ret = NVIF_EVENT_KEEP;
> +	int drop = 0;
>   
>   	spin_lock_irqsave(&fctx->lock, flags);
>   	if (!list_empty(&fctx->pending)) {
> @@ -160,11 +162,20 @@ nouveau_fence_wait_uevent_handler(struct nvif_event *event, void *repv, u32 repc
>   		fence = list_entry(fctx->pending.next, typeof(*fence), head);
>   		chan = rcu_dereference_protected(fence->channel, lockdep_is_held(&fctx->lock));
>   		if (nouveau_fence_update(chan, fctx))
> -			ret = NVIF_EVENT_DROP;
> +			drop = 1;
>   	}
> +	if (drop)
> +		nvif_event_block(&fctx->event);
> +
>   	spin_unlock_irqrestore(&fctx->lock, flags);
> +}
>   
> -	return ret;
> +static int
> +nouveau_fence_wait_uevent_handler(struct nvif_event *event, void *repv, u32 repc)
> +{
> +	struct nouveau_fence_chan *fctx = container_of(event, typeof(*fctx), event);
> +	schedule_work(&fctx->uevent_work);
> +	return NVIF_EVENT_KEEP;
>   }
>   
>   void
> @@ -178,6 +189,7 @@ nouveau_fence_context_new(struct nouveau_channel *chan, struct nouveau_fence_cha
>   	} args;
>   	int ret;
>   
> +	INIT_WORK(&fctx->uevent_work, nouveau_fence_uevent_work);
>   	INIT_LIST_HEAD(&fctx->flip);
>   	INIT_LIST_HEAD(&fctx->pending);
>   	spin_lock_init(&fctx->lock);
> diff --git a/drivers/gpu/drm/nouveau/nouveau_fence.h b/drivers/gpu/drm/nouveau/nouveau_fence.h
> index 64d33ae7f356..8bc065acfe35 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_fence.h
> +++ b/drivers/gpu/drm/nouveau/nouveau_fence.h
> @@ -44,6 +44,7 @@ struct nouveau_fence_chan {
>   	u32 context;
>   	char name[32];
>   
> +	struct work_struct uevent_work;
>   	struct nvif_event event;
>   	int notify_ref, dead, killed;
>   };

Dave Airlie Feb. 5, 2024, 9:08 p.m. UTC | #2

On Tue, 6 Feb 2024 at 02:22, Danilo Krummrich <dakr@redhat.com> wrote:
>
> On 1/29/24 02:50, Dave Airlie wrote:
> > From: Dave Airlie <airlied@redhat.com>
> >
> > This should break the deadlock between the fctx lock and the irq lock.
> >
> > This offloads the processing off the work from the irq into a workqueue.
> >
> > Signed-off-by: Dave Airlie <airlied@redhat.com>
>
> Nouveau's scheduler uses a dedicated wq, hence from this perspective it's
> safe deferring fence signalling to the kernel global wq. However, I wonder
> if we could create deadlocks by building dependency chains into other
> drivers / kernel code that, by chance, makes use of the kernel global wq as
> well.
>
> Admittedly, even if, it's gonna be extremely unlikely given that
> WQ_MAX_ACTIVE == 512. But maybe it'd be safer to use a dedicated wq.
>
> Also, do we need to CC stable?

I pushed this to Linus at the end of last week, since the hangs in 6.7
take out the complete system and I wanted it in stable.

It might be safer to use a dedicated wq, is the concern someone is
waiting on a fence in a workqueue somewhere else so we will never
signal it?

Dave.

Danilo Krummrich Feb. 5, 2024, 10 p.m. UTC | #3

On 2/5/24 22:08, Dave Airlie wrote:
> On Tue, 6 Feb 2024 at 02:22, Danilo Krummrich <dakr@redhat.com> wrote:
>>
>> On 1/29/24 02:50, Dave Airlie wrote:
>>> From: Dave Airlie <airlied@redhat.com>
>>>
>>> This should break the deadlock between the fctx lock and the irq lock.
>>>
>>> This offloads the processing off the work from the irq into a workqueue.
>>>
>>> Signed-off-by: Dave Airlie <airlied@redhat.com>
>>
>> Nouveau's scheduler uses a dedicated wq, hence from this perspective it's
>> safe deferring fence signalling to the kernel global wq. However, I wonder
>> if we could create deadlocks by building dependency chains into other
>> drivers / kernel code that, by chance, makes use of the kernel global wq as
>> well.
>>
>> Admittedly, even if, it's gonna be extremely unlikely given that
>> WQ_MAX_ACTIVE == 512. But maybe it'd be safer to use a dedicated wq.
>>
>> Also, do we need to CC stable?
> 
> I pushed this to Linus at the end of last week, since the hangs in 6.7
> take out the complete system and I wanted it in stable.
> 
> It might be safer to use a dedicated wq, is the concern someone is
> waiting on a fence in a workqueue somewhere else so we will never
> signal it?

Yes, if some other work is waiting for this fence (or something else in the same
dependency chain) to signal it can prevent executing the work signaling this fence,
in case both are scheduled on the same wq. As mentioned, with the kernel global wq
this would be rather unlikely to lead to an actual stall with WQ_MAX_ACTIVE == 512,
but formally the race condition exists. I guess a malicious attacker could try to
intentionally push jobs directly or indirectly depending on this fence to a driver
which queues them up on a scheduler using the kernel global wq.

> 
> Dave.
>

Daniel Vetter Feb. 6, 2024, 2:03 p.m. UTC | #4

On Mon, Feb 05, 2024 at 11:00:04PM +0100, Danilo Krummrich wrote:
> On 2/5/24 22:08, Dave Airlie wrote:
> > On Tue, 6 Feb 2024 at 02:22, Danilo Krummrich <dakr@redhat.com> wrote:
> > > 
> > > On 1/29/24 02:50, Dave Airlie wrote:
> > > > From: Dave Airlie <airlied@redhat.com>
> > > > 
> > > > This should break the deadlock between the fctx lock and the irq lock.
> > > > 
> > > > This offloads the processing off the work from the irq into a workqueue.
> > > > 
> > > > Signed-off-by: Dave Airlie <airlied@redhat.com>
> > > 
> > > Nouveau's scheduler uses a dedicated wq, hence from this perspective it's
> > > safe deferring fence signalling to the kernel global wq. However, I wonder
> > > if we could create deadlocks by building dependency chains into other
> > > drivers / kernel code that, by chance, makes use of the kernel global wq as
> > > well.
> > > 
> > > Admittedly, even if, it's gonna be extremely unlikely given that
> > > WQ_MAX_ACTIVE == 512. But maybe it'd be safer to use a dedicated wq.
> > > 
> > > Also, do we need to CC stable?
> > 
> > I pushed this to Linus at the end of last week, since the hangs in 6.7
> > take out the complete system and I wanted it in stable.
> > 
> > It might be safer to use a dedicated wq, is the concern someone is
> > waiting on a fence in a workqueue somewhere else so we will never
> > signal it?
> 
> Yes, if some other work is waiting for this fence (or something else in the same
> dependency chain) to signal it can prevent executing the work signaling this fence,
> in case both are scheduled on the same wq. As mentioned, with the kernel global wq
> this would be rather unlikely to lead to an actual stall with WQ_MAX_ACTIVE == 512,
> but formally the race condition exists. I guess a malicious attacker could try to
> intentionally push jobs directly or indirectly depending on this fence to a driver
> which queues them up on a scheduler using the kernel global wq.

I think if you add dma_fence_signalling annotations (aside, there's some
patch from iirc Thomas Hellstrom to improve them and cut down on some
false positives, but I lost track) then I think you won't get any splats
because the wq subsystem assumes that WC_MAX_ACTIVE is close enough to
infinity to not matter.

I'm not sure we should care differently, but I guess it'd be good to
annotate it all in case the wq subsystem's idea of how much such deadlocks
are real changes.

Also Teo is on a mission to get rid of all the global wq flushes, so there
should also be no source of deadlocks from that kind of cross-driver
dependency. Or at least shouldn't be in the future, I'm not sure it all
landed.
-Sima

Danilo Krummrich Feb. 9, 2024, 5:41 p.m. UTC | #5

On 2/6/24 15:03, Daniel Vetter wrote:
> On Mon, Feb 05, 2024 at 11:00:04PM +0100, Danilo Krummrich wrote:
>> On 2/5/24 22:08, Dave Airlie wrote:
>>> On Tue, 6 Feb 2024 at 02:22, Danilo Krummrich <dakr@redhat.com> wrote:
>>>>
>>>> On 1/29/24 02:50, Dave Airlie wrote:
>>>>> From: Dave Airlie <airlied@redhat.com>
>>>>>
>>>>> This should break the deadlock between the fctx lock and the irq lock.
>>>>>
>>>>> This offloads the processing off the work from the irq into a workqueue.
>>>>>
>>>>> Signed-off-by: Dave Airlie <airlied@redhat.com>
>>>>
>>>> Nouveau's scheduler uses a dedicated wq, hence from this perspective it's
>>>> safe deferring fence signalling to the kernel global wq. However, I wonder
>>>> if we could create deadlocks by building dependency chains into other
>>>> drivers / kernel code that, by chance, makes use of the kernel global wq as
>>>> well.
>>>>
>>>> Admittedly, even if, it's gonna be extremely unlikely given that
>>>> WQ_MAX_ACTIVE == 512. But maybe it'd be safer to use a dedicated wq.
>>>>
>>>> Also, do we need to CC stable?
>>>
>>> I pushed this to Linus at the end of last week, since the hangs in 6.7
>>> take out the complete system and I wanted it in stable.
>>>
>>> It might be safer to use a dedicated wq, is the concern someone is
>>> waiting on a fence in a workqueue somewhere else so we will never
>>> signal it?
>>
>> Yes, if some other work is waiting for this fence (or something else in the same
>> dependency chain) to signal it can prevent executing the work signaling this fence,
>> in case both are scheduled on the same wq. As mentioned, with the kernel global wq
>> this would be rather unlikely to lead to an actual stall with WQ_MAX_ACTIVE == 512,
>> but formally the race condition exists. I guess a malicious attacker could try to
>> intentionally push jobs directly or indirectly depending on this fence to a driver
>> which queues them up on a scheduler using the kernel global wq.
> 
> I think if you add dma_fence_signalling annotations (aside, there's some
> patch from iirc Thomas Hellstrom to improve them and cut down on some
> false positives, but I lost track) then I think you won't get any splats
> because the wq subsystem assumes that WC_MAX_ACTIVE is close enough to
> infinity to not matter.

As mentioned, for the kernel global wq it's 512. (Intentionally) feeding the kernel
with enough jobs to to provoke a deadlock doesn't seem impossible to me.

I think it'd be safer to just establish not to use the kernel global wq for executing
work in the fence signalling critical path.

We could also run into similar problems with a dedicated wq, e.g. when drivers share
a wq between drm_gpu_scheduler instances (see [1]), however, I'm not sure we can catch
that with lockdep.

[1] https://elixir.bootlin.com/linux/v6.8-rc3/source/drivers/gpu/drm/nouveau/nouveau_drm.c#L313

> 
> I'm not sure we should care differently, but I guess it'd be good to
> annotate it all in case the wq subsystem's idea of how much such deadlocks
> are real changes.
> 
> Also Teo is on a mission to get rid of all the global wq flushes, so there
> should also be no source of deadlocks from that kind of cross-driver
> dependency. Or at least shouldn't be in the future, I'm not sure it all
> landed.
> -Sima

Daniel Vetter Feb. 9, 2024, 6:52 p.m. UTC | #6

On Fri, Feb 09, 2024 at 06:41:32PM +0100, Danilo Krummrich wrote:
> On 2/6/24 15:03, Daniel Vetter wrote:
> > On Mon, Feb 05, 2024 at 11:00:04PM +0100, Danilo Krummrich wrote:
> > > On 2/5/24 22:08, Dave Airlie wrote:
> > > > On Tue, 6 Feb 2024 at 02:22, Danilo Krummrich <dakr@redhat.com> wrote:
> > > > > 
> > > > > On 1/29/24 02:50, Dave Airlie wrote:
> > > > > > From: Dave Airlie <airlied@redhat.com>
> > > > > > 
> > > > > > This should break the deadlock between the fctx lock and the irq lock.
> > > > > > 
> > > > > > This offloads the processing off the work from the irq into a workqueue.
> > > > > > 
> > > > > > Signed-off-by: Dave Airlie <airlied@redhat.com>
> > > > > 
> > > > > Nouveau's scheduler uses a dedicated wq, hence from this perspective it's
> > > > > safe deferring fence signalling to the kernel global wq. However, I wonder
> > > > > if we could create deadlocks by building dependency chains into other
> > > > > drivers / kernel code that, by chance, makes use of the kernel global wq as
> > > > > well.
> > > > > 
> > > > > Admittedly, even if, it's gonna be extremely unlikely given that
> > > > > WQ_MAX_ACTIVE == 512. But maybe it'd be safer to use a dedicated wq.
> > > > > 
> > > > > Also, do we need to CC stable?
> > > > 
> > > > I pushed this to Linus at the end of last week, since the hangs in 6.7
> > > > take out the complete system and I wanted it in stable.
> > > > 
> > > > It might be safer to use a dedicated wq, is the concern someone is
> > > > waiting on a fence in a workqueue somewhere else so we will never
> > > > signal it?
> > > 
> > > Yes, if some other work is waiting for this fence (or something else in the same
> > > dependency chain) to signal it can prevent executing the work signaling this fence,
> > > in case both are scheduled on the same wq. As mentioned, with the kernel global wq
> > > this would be rather unlikely to lead to an actual stall with WQ_MAX_ACTIVE == 512,
> > > but formally the race condition exists. I guess a malicious attacker could try to
> > > intentionally push jobs directly or indirectly depending on this fence to a driver
> > > which queues them up on a scheduler using the kernel global wq.
> > 
> > I think if you add dma_fence_signalling annotations (aside, there's some
> > patch from iirc Thomas Hellstrom to improve them and cut down on some
> > false positives, but I lost track) then I think you won't get any splats
> > because the wq subsystem assumes that WC_MAX_ACTIVE is close enough to
> > infinity to not matter.
> 
> As mentioned, for the kernel global wq it's 512. (Intentionally) feeding the kernel
> with enough jobs to to provoke a deadlock doesn't seem impossible to me.
> 
> I think it'd be safer to just establish not to use the kernel global wq for executing
> work in the fence signalling critical path.
> 
> We could also run into similar problems with a dedicated wq, e.g. when drivers share
> a wq between drm_gpu_scheduler instances (see [1]), however, I'm not sure we can catch
> that with lockdep.

I think if you want to fix it perfectly you'd need to set the max number
of wq to the number of engines (or for dynamic/fw scheduled engines to the
number of context) you have. Or whatever limit to the number of parallel
timelines there is.

I guess this would need a new wq function to update? drm/sched code could
be able to set that for drivers, so drivers cannot get this wrong.

If we don't do something like that then I'm not sure there's really much
benefit - instead of carefully timing 512 dma_fence on the global wq you
just need to create a pile of context (at least on intel's guc that's
absolutely no issue) and then careful time them.

I still feel like we have bigger fish to fry ... But might be worth the
effort to at least make the parallel timeline limit a lot more explicit?

Cheers, Sima

> 
> [1] https://elixir.bootlin.com/linux/v6.8-rc3/source/drivers/gpu/drm/nouveau/nouveau_drm.c#L313
> 
> > 
> > I'm not sure we should care differently, but I guess it'd be good to
> > annotate it all in case the wq subsystem's idea of how much such deadlocks
> > are real changes.
> > 
> > Also Teo is on a mission to get rid of all the global wq flushes, so there
> > should also be no source of deadlocks from that kind of cross-driver
> > dependency. Or at least shouldn't be in the future, I'm not sure it all
> > landed.
> > -Sima
>

Danilo Krummrich Feb. 13, 2024, 5:39 p.m. UTC | #7

On 2/9/24 19:52, Daniel Vetter wrote:
> On Fri, Feb 09, 2024 at 06:41:32PM +0100, Danilo Krummrich wrote:
>> On 2/6/24 15:03, Daniel Vetter wrote:
>>> On Mon, Feb 05, 2024 at 11:00:04PM +0100, Danilo Krummrich wrote:
>>>> On 2/5/24 22:08, Dave Airlie wrote:
>>>>> On Tue, 6 Feb 2024 at 02:22, Danilo Krummrich <dakr@redhat.com> wrote:
>>>>>>
>>>>>> On 1/29/24 02:50, Dave Airlie wrote:
>>>>>>> From: Dave Airlie <airlied@redhat.com>
>>>>>>>
>>>>>>> This should break the deadlock between the fctx lock and the irq lock.
>>>>>>>
>>>>>>> This offloads the processing off the work from the irq into a workqueue.
>>>>>>>
>>>>>>> Signed-off-by: Dave Airlie <airlied@redhat.com>
>>>>>>
>>>>>> Nouveau's scheduler uses a dedicated wq, hence from this perspective it's
>>>>>> safe deferring fence signalling to the kernel global wq. However, I wonder
>>>>>> if we could create deadlocks by building dependency chains into other
>>>>>> drivers / kernel code that, by chance, makes use of the kernel global wq as
>>>>>> well.
>>>>>>
>>>>>> Admittedly, even if, it's gonna be extremely unlikely given that
>>>>>> WQ_MAX_ACTIVE == 512. But maybe it'd be safer to use a dedicated wq.
>>>>>>
>>>>>> Also, do we need to CC stable?
>>>>>
>>>>> I pushed this to Linus at the end of last week, since the hangs in 6.7
>>>>> take out the complete system and I wanted it in stable.
>>>>>
>>>>> It might be safer to use a dedicated wq, is the concern someone is
>>>>> waiting on a fence in a workqueue somewhere else so we will never
>>>>> signal it?
>>>>
>>>> Yes, if some other work is waiting for this fence (or something else in the same
>>>> dependency chain) to signal it can prevent executing the work signaling this fence,
>>>> in case both are scheduled on the same wq. As mentioned, with the kernel global wq
>>>> this would be rather unlikely to lead to an actual stall with WQ_MAX_ACTIVE == 512,
>>>> but formally the race condition exists. I guess a malicious attacker could try to
>>>> intentionally push jobs directly or indirectly depending on this fence to a driver
>>>> which queues them up on a scheduler using the kernel global wq.
>>>
>>> I think if you add dma_fence_signalling annotations (aside, there's some
>>> patch from iirc Thomas Hellstrom to improve them and cut down on some
>>> false positives, but I lost track) then I think you won't get any splats
>>> because the wq subsystem assumes that WC_MAX_ACTIVE is close enough to
>>> infinity to not matter.
>>
>> As mentioned, for the kernel global wq it's 512. (Intentionally) feeding the kernel
>> with enough jobs to to provoke a deadlock doesn't seem impossible to me.
>>
>> I think it'd be safer to just establish not to use the kernel global wq for executing
>> work in the fence signalling critical path.
>>
>> We could also run into similar problems with a dedicated wq, e.g. when drivers share
>> a wq between drm_gpu_scheduler instances (see [1]), however, I'm not sure we can catch
>> that with lockdep.
> 
> I think if you want to fix it perfectly you'd need to set the max number
> of wq to the number of engines (or for dynamic/fw scheduled engines to the
> number of context) you have. Or whatever limit to the number of parallel
> timelines there is.> 
> I guess this would need a new wq function to update? drm/sched code could
> be able to set that for drivers, so drivers cannot get this wrong.

Not sure I can follow. The scheduler instance might be per context and bind
queue. In this case it gets the shared wq passed, but doesn't know how many
other scheduler instances share the same one.

Additionally, there might be drivers not using the DRM scheduler for for bind
queues at all (I think Xe does not).

> 
> If we don't do something like that then I'm not sure there's really much
> benefit - instead of carefully timing 512 dma_fence on the global wq you
> just need to create a pile of context (at least on intel's guc that's
> absolutely no issue) and then careful time them.

Well, that's true. I'd still argue that there is a slight difference. From a
drivers isolated perspective using the global kernel wq might be entirely fine,
as in this patch. However, in combination with another driver doing the same
thing, things can blow up. For the case you illustrated it's at least possible
to spot it from a driver's perspective.

> 
> I still feel like we have bigger fish to fry ... But might be worth the
> effort to at least make the parallel timeline limit a lot more explicit?

I agree, and it'd be great if we can find a solution such bugs can be detected
systematically (e.g. through lockdep), but maybe we can start to at least
document that we should never use the kernel global wq and where we need to be
careful in sharing driver wqs.

- Danilo

> 
> Cheers, Sima
> 
>>
>> [1] https://elixir.bootlin.com/linux/v6.8-rc3/source/drivers/gpu/drm/nouveau/nouveau_drm.c#L313
>>
>>>
>>> I'm not sure we should care differently, but I guess it'd be good to
>>> annotate it all in case the wq subsystem's idea of how much such deadlocks
>>> are real changes.
>>>
>>> Also Teo is on a mission to get rid of all the global wq flushes, so there
>>> should also be no source of deadlocks from that kind of cross-driver
>>> dependency. Or at least shouldn't be in the future, I'm not sure it all
>>> landed.
>>> -Sima
>>
>

Daniel Vetter Feb. 16, 2024, 4:41 p.m. UTC | #8

On Tue, Feb 13, 2024 at 06:39:20PM +0100, Danilo Krummrich wrote:
> On 2/9/24 19:52, Daniel Vetter wrote:
> > On Fri, Feb 09, 2024 at 06:41:32PM +0100, Danilo Krummrich wrote:
> > > On 2/6/24 15:03, Daniel Vetter wrote:
> > > > On Mon, Feb 05, 2024 at 11:00:04PM +0100, Danilo Krummrich wrote:
> > > > > On 2/5/24 22:08, Dave Airlie wrote:
> > > > > > On Tue, 6 Feb 2024 at 02:22, Danilo Krummrich <dakr@redhat.com> wrote:
> > > > > > > 
> > > > > > > On 1/29/24 02:50, Dave Airlie wrote:
> > > > > > > > From: Dave Airlie <airlied@redhat.com>
> > > > > > > > 
> > > > > > > > This should break the deadlock between the fctx lock and the irq lock.
> > > > > > > > 
> > > > > > > > This offloads the processing off the work from the irq into a workqueue.
> > > > > > > > 
> > > > > > > > Signed-off-by: Dave Airlie <airlied@redhat.com>
> > > > > > > 
> > > > > > > Nouveau's scheduler uses a dedicated wq, hence from this perspective it's
> > > > > > > safe deferring fence signalling to the kernel global wq. However, I wonder
> > > > > > > if we could create deadlocks by building dependency chains into other
> > > > > > > drivers / kernel code that, by chance, makes use of the kernel global wq as
> > > > > > > well.
> > > > > > > 
> > > > > > > Admittedly, even if, it's gonna be extremely unlikely given that
> > > > > > > WQ_MAX_ACTIVE == 512. But maybe it'd be safer to use a dedicated wq.
> > > > > > > 
> > > > > > > Also, do we need to CC stable?
> > > > > > 
> > > > > > I pushed this to Linus at the end of last week, since the hangs in 6.7
> > > > > > take out the complete system and I wanted it in stable.
> > > > > > 
> > > > > > It might be safer to use a dedicated wq, is the concern someone is
> > > > > > waiting on a fence in a workqueue somewhere else so we will never
> > > > > > signal it?
> > > > > 
> > > > > Yes, if some other work is waiting for this fence (or something else in the same
> > > > > dependency chain) to signal it can prevent executing the work signaling this fence,
> > > > > in case both are scheduled on the same wq. As mentioned, with the kernel global wq
> > > > > this would be rather unlikely to lead to an actual stall with WQ_MAX_ACTIVE == 512,
> > > > > but formally the race condition exists. I guess a malicious attacker could try to
> > > > > intentionally push jobs directly or indirectly depending on this fence to a driver
> > > > > which queues them up on a scheduler using the kernel global wq.
> > > > 
> > > > I think if you add dma_fence_signalling annotations (aside, there's some
> > > > patch from iirc Thomas Hellstrom to improve them and cut down on some
> > > > false positives, but I lost track) then I think you won't get any splats
> > > > because the wq subsystem assumes that WC_MAX_ACTIVE is close enough to
> > > > infinity to not matter.
> > > 
> > > As mentioned, for the kernel global wq it's 512. (Intentionally) feeding the kernel
> > > with enough jobs to to provoke a deadlock doesn't seem impossible to me.
> > > 
> > > I think it'd be safer to just establish not to use the kernel global wq for executing
> > > work in the fence signalling critical path.
> > > 
> > > We could also run into similar problems with a dedicated wq, e.g. when drivers share
> > > a wq between drm_gpu_scheduler instances (see [1]), however, I'm not sure we can catch
> > > that with lockdep.
> > 
> > I think if you want to fix it perfectly you'd need to set the max number
> > of wq to the number of engines (or for dynamic/fw scheduled engines to the
> > number of context) you have. Or whatever limit to the number of parallel
> > timelines there is.> I guess this would need a new wq function to
> > update? drm/sched code could
> > be able to set that for drivers, so drivers cannot get this wrong.
> 
> Not sure I can follow. The scheduler instance might be per context and bind
> queue. In this case it gets the shared wq passed, but doesn't know how many
> other scheduler instances share the same one.

Yeah that's why maybe more of that logic should be in the drm/sched code
instead of drivers just cleverly using what's there ...

> Additionally, there might be drivers not using the DRM scheduler for for bind
> queues at all (I think Xe does not).

Uh ... maybe we should do this the same across all drivers? But I also
thought that Xe was flat-out synchronous and only had an out-fence since
you need a userspace thread in mesa for vk semantics anyway.

If xe hand-rolled a scheduler I'm not going to be very amused.

> > If we don't do something like that then I'm not sure there's really much
> > benefit - instead of carefully timing 512 dma_fence on the global wq you
> > just need to create a pile of context (at least on intel's guc that's
> > absolutely no issue) and then careful time them.
> 
> Well, that's true. I'd still argue that there is a slight difference. From a
> drivers isolated perspective using the global kernel wq might be entirely fine,
> as in this patch. However, in combination with another driver doing the same
> thing, things can blow up. For the case you illustrated it's at least possible
> to spot it from a driver's perspective.
> 
> > 
> > I still feel like we have bigger fish to fry ... But might be worth the
> > effort to at least make the parallel timeline limit a lot more explicit?
> 
> I agree, and it'd be great if we can find a solution such bugs can be detected
> systematically (e.g. through lockdep), but maybe we can start to at least
> document that we should never use the kernel global wq and where we need to be
> careful in sharing driver wqs.

Yeah I guess the above two are other reasons why maybe we need a bit more
structure in scheduler apis instead of just allowing drivers to hand in
shared wq pointers. Something like a struct drm_sched_domain, which
contains the wq + a list of drm_sched for it. Would also make stuff like
reliably stopping the right amount of schedulers in tdr much more robust.
-Sima

> 
> - Danilo
> 
> > 
> > Cheers, Sima
> > 
> > > 
> > > [1] https://elixir.bootlin.com/linux/v6.8-rc3/source/drivers/gpu/drm/nouveau/nouveau_drm.c#L313
> > > 
> > > > 
> > > > I'm not sure we should care differently, but I guess it'd be good to
> > > > annotate it all in case the wq subsystem's idea of how much such deadlocks
> > > > are real changes.
> > > > 
> > > > Also Teo is on a mission to get rid of all the global wq flushes, so there
> > > > should also be no source of deadlocks from that kind of cross-driver
> > > > dependency. Or at least shouldn't be in the future, I'm not sure it all
> > > > landed.
> > > > -Sima
> > > 
> > 
>

Danilo Krummrich Feb. 19, 2024, 10:14 a.m. UTC | #9

On 2/16/24 17:41, Daniel Vetter wrote:
> On Tue, Feb 13, 2024 at 06:39:20PM +0100, Danilo Krummrich wrote:
>> On 2/9/24 19:52, Daniel Vetter wrote:
>>> On Fri, Feb 09, 2024 at 06:41:32PM +0100, Danilo Krummrich wrote:
>>>> On 2/6/24 15:03, Daniel Vetter wrote:
>>>>> On Mon, Feb 05, 2024 at 11:00:04PM +0100, Danilo Krummrich wrote:
>>>>>> On 2/5/24 22:08, Dave Airlie wrote:
>>>>>>> On Tue, 6 Feb 2024 at 02:22, Danilo Krummrich <dakr@redhat.com> wrote:
>>>>>>>>
>>>>>>>> On 1/29/24 02:50, Dave Airlie wrote:
>>>>>>>>> From: Dave Airlie <airlied@redhat.com>
>>>>>>>>>
>>>>>>>>> This should break the deadlock between the fctx lock and the irq lock.
>>>>>>>>>
>>>>>>>>> This offloads the processing off the work from the irq into a workqueue.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Dave Airlie <airlied@redhat.com>
>>>>>>>>
>>>>>>>> Nouveau's scheduler uses a dedicated wq, hence from this perspective it's
>>>>>>>> safe deferring fence signalling to the kernel global wq. However, I wonder
>>>>>>>> if we could create deadlocks by building dependency chains into other
>>>>>>>> drivers / kernel code that, by chance, makes use of the kernel global wq as
>>>>>>>> well.
>>>>>>>>
>>>>>>>> Admittedly, even if, it's gonna be extremely unlikely given that
>>>>>>>> WQ_MAX_ACTIVE == 512. But maybe it'd be safer to use a dedicated wq.
>>>>>>>>
>>>>>>>> Also, do we need to CC stable?
>>>>>>>
>>>>>>> I pushed this to Linus at the end of last week, since the hangs in 6.7
>>>>>>> take out the complete system and I wanted it in stable.
>>>>>>>
>>>>>>> It might be safer to use a dedicated wq, is the concern someone is
>>>>>>> waiting on a fence in a workqueue somewhere else so we will never
>>>>>>> signal it?
>>>>>>
>>>>>> Yes, if some other work is waiting for this fence (or something else in the same
>>>>>> dependency chain) to signal it can prevent executing the work signaling this fence,
>>>>>> in case both are scheduled on the same wq. As mentioned, with the kernel global wq
>>>>>> this would be rather unlikely to lead to an actual stall with WQ_MAX_ACTIVE == 512,
>>>>>> but formally the race condition exists. I guess a malicious attacker could try to
>>>>>> intentionally push jobs directly or indirectly depending on this fence to a driver
>>>>>> which queues them up on a scheduler using the kernel global wq.
>>>>>
>>>>> I think if you add dma_fence_signalling annotations (aside, there's some
>>>>> patch from iirc Thomas Hellstrom to improve them and cut down on some
>>>>> false positives, but I lost track) then I think you won't get any splats
>>>>> because the wq subsystem assumes that WC_MAX_ACTIVE is close enough to
>>>>> infinity to not matter.
>>>>
>>>> As mentioned, for the kernel global wq it's 512. (Intentionally) feeding the kernel
>>>> with enough jobs to to provoke a deadlock doesn't seem impossible to me.
>>>>
>>>> I think it'd be safer to just establish not to use the kernel global wq for executing
>>>> work in the fence signalling critical path.
>>>>
>>>> We could also run into similar problems with a dedicated wq, e.g. when drivers share
>>>> a wq between drm_gpu_scheduler instances (see [1]), however, I'm not sure we can catch
>>>> that with lockdep.
>>>
>>> I think if you want to fix it perfectly you'd need to set the max number
>>> of wq to the number of engines (or for dynamic/fw scheduled engines to the
>>> number of context) you have. Or whatever limit to the number of parallel
>>> timelines there is.> I guess this would need a new wq function to
>>> update? drm/sched code could
>>> be able to set that for drivers, so drivers cannot get this wrong.
>>
>> Not sure I can follow. The scheduler instance might be per context and bind
>> queue. In this case it gets the shared wq passed, but doesn't know how many
>> other scheduler instances share the same one.
> 
> Yeah that's why maybe more of that logic should be in the drm/sched code
> instead of drivers just cleverly using what's there ...

Agree, that's gonna be a huge design change though.

> 
>> Additionally, there might be drivers not using the DRM scheduler for for bind
>> queues at all (I think Xe does not).
> 
> Uh ... maybe we should do this the same across all drivers? But I also
> thought that Xe was flat-out synchronous and only had an out-fence since
> you need a userspace thread in mesa for vk semantics anyway.
> 
> If xe hand-rolled a scheduler I'm not going to be very amused.

I really don't know the details, but there are similarities at least.

There is the the rebind work, which seems to be called in some VM_BIND cases and
in the context of an EXEC ioctl and seems to signal a fence. It seems valid to
not stuff this into the scheduler.

There are also cases like this one, where we have fence signalling critical code
in wqs outside the context of a scheduler instance.

> 
>>> If we don't do something like that then I'm not sure there's really much
>>> benefit - instead of carefully timing 512 dma_fence on the global wq you
>>> just need to create a pile of context (at least on intel's guc that's
>>> absolutely no issue) and then careful time them.
>>
>> Well, that's true. I'd still argue that there is a slight difference. From a
>> drivers isolated perspective using the global kernel wq might be entirely fine,
>> as in this patch. However, in combination with another driver doing the same
>> thing, things can blow up. For the case you illustrated it's at least possible
>> to spot it from a driver's perspective.
>>
>>>
>>> I still feel like we have bigger fish to fry ... But might be worth the
>>> effort to at least make the parallel timeline limit a lot more explicit?
>>
>> I agree, and it'd be great if we can find a solution such bugs can be detected
>> systematically (e.g. through lockdep), but maybe we can start to at least
>> document that we should never use the kernel global wq and where we need to be
>> careful in sharing driver wqs.
> 
> Yeah I guess the above two are other reasons why maybe we need a bit more
> structure in scheduler apis instead of just allowing drivers to hand in
> shared wq pointers. Something like a struct drm_sched_domain, which
> contains the wq + a list of drm_sched for it. Would also make stuff like
> reliably stopping the right amount of schedulers in tdr much more robust.

Yeah, that sounds like a good starting point. I can also add a corresponding entry
to the DRM TODO list explaining the issue.

- Danilo

> -Sima
> 
>>
>> - Danilo
>>
>>>
>>> Cheers, Sima
>>>
>>>>
>>>> [1] https://elixir.bootlin.com/linux/v6.8-rc3/source/drivers/gpu/drm/nouveau/nouveau_drm.c#L313
>>>>
>>>>>
>>>>> I'm not sure we should care differently, but I guess it'd be good to
>>>>> annotate it all in case the wq subsystem's idea of how much such deadlocks
>>>>> are real changes.
>>>>>
>>>>> Also Teo is on a mission to get rid of all the global wq flushes, so there
>>>>> should also be no source of deadlocks from that kind of cross-driver
>>>>> dependency. Or at least shouldn't be in the future, I'm not sure it all
>>>>> landed.
>>>>> -Sima
>>>>
>>>
>>
>

nouveau: offload fence uevents work to workqueue

Commit Message

Comments

Patch