nouveau: fix client work fence deletion race

Message ID	20230615024008.1600281-1-airlied@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Dave Airlie <airlied@gmail.com> To: dri-devel@lists.freedesktop.org Subject: [PATCH] nouveau: fix client work fence deletion race Date: Thu, 15 Jun 2023 12:40:08 +1000 Message-Id: <20230615024008.1600281-1-airlied@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=WINDOWS-1252; x-default=true Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	nouveau: fix client work fence deletion race \| expand nouveau: fix client work fence deletion race

Message ID

20230615024008.1600281-1-airlied@gmail.com (mailing list archive)

State

New, archived

Headers

From: Dave Airlie <airlied@gmail.com>
To: dri-devel@lists.freedesktop.org
Subject: [PATCH] nouveau: fix client work fence deletion race
Date: Thu, 15 Jun 2023 12:40:08 +1000
Message-Id: <20230615024008.1600281-1-airlied@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=WINDOWS-1252; x-default=true
Precedence: list
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

Series

nouveau: fix client work fence deletion race | expand

Commit Message

Dave Airlie June 15, 2023, 2:40 a.m. UTC

From: Dave Airlie <airlied@redhat.com>

This seems to have existed for ever but is now more apparant after
9bff18d13473a9fdf81d5158248472a9d8ecf2bd (drm/ttm: use per BO cleanup workers)

My analysis:
two threads are running,
one in the irq signalling the fence, in dma_fence_signal_timestamp_locked,
it has done the DMA_FENCE_FLAG_SIGNALLED_BIT setting, but hasn't yet reached the callbacks.

second thread in nouveau_cli_work_ready, where it sees the fence is signalled, so then puts the
fence, cleanups the object and frees the work item, which contains the callback.

thread one goes again and tries to call the callback and causes the use-after-free.

Proposed fix:
lock the fence signalled check in nouveau_cli_work_ready, so either the callbacks are done
or the memory is freed.

Signed-off-by: Dave Airlie <airlied@redhat.com>
---
 drivers/gpu/drm/nouveau/nouveau_drm.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

Comments

Karol Herbst June 15, 2023, 10:45 a.m. UTC | #1

On Thu, Jun 15, 2023 at 4:47 AM Dave Airlie <airlied@gmail.com> wrote:
>
> From: Dave Airlie <airlied@redhat.com>
>
> This seems to have existed for ever but is now more apparant after
> 9bff18d13473a9fdf81d5158248472a9d8ecf2bd (drm/ttm: use per BO cleanup workers)
>
> My analysis:
> two threads are running,
> one in the irq signalling the fence, in dma_fence_signal_timestamp_locked,
> it has done the DMA_FENCE_FLAG_SIGNALLED_BIT setting, but hasn't yet reached the callbacks.
>
> second thread in nouveau_cli_work_ready, where it sees the fence is signalled, so then puts the
> fence, cleanups the object and frees the work item, which contains the callback.
>
> thread one goes again and tries to call the callback and causes the use-after-free.
>
> Proposed fix:
> lock the fence signalled check in nouveau_cli_work_ready, so either the callbacks are done
> or the memory is freed.
>
> Signed-off-by: Dave Airlie <airlied@redhat.com>

Fixes: 11e451e74050 ("drm/nouveau: remove fence wait code from
deferred client work handler")

Is I think the most reasonable commit to tag to ensure backports.

> ---
>  drivers/gpu/drm/nouveau/nouveau_drm.c | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c
> index cc7c5b4a05fd..1a45be769848 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_drm.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
> @@ -137,10 +137,16 @@ nouveau_name(struct drm_device *dev)
>  static inline bool
>  nouveau_cli_work_ready(struct dma_fence *fence)
>  {
> -       if (!dma_fence_is_signaled(fence))
> -               return false;
> -       dma_fence_put(fence);
> -       return true;
> +       unsigned long flags;
> +       bool ret = true;
> +       spin_lock_irqsave(fence->lock, flags);
> +       if (!dma_fence_is_signaled_locked(fence))
> +               ret = false;
> +       spin_unlock_irqrestore(fence->lock, flags);

I agree with the code being always broken as
`dma_fence_is_signaled_locked` specifies: "This function requires
&dma_fence.lock to be held."

> +
> +       if (ret == true)
> +               dma_fence_put(fence);
> +       return ret;
>  }
>
>  static void
> --
> 2.40.1
>

regardless of the patch fixing the issue users have seen:

Reviewed-by: Karol Herbst <kherbst@redhat.com>

Lucas Stach June 15, 2023, 11:10 a.m. UTC | #2

Am Donnerstag, dem 15.06.2023 um 12:40 +1000 schrieb Dave Airlie:
> From: Dave Airlie <airlied@redhat.com>
> 
> This seems to have existed for ever but is now more apparant after
> 9bff18d13473a9fdf81d5158248472a9d8ecf2bd (drm/ttm: use per BO cleanup workers)
> 
> My analysis:
> two threads are running,
> one in the irq signalling the fence, in dma_fence_signal_timestamp_locked,
> it has done the DMA_FENCE_FLAG_SIGNALLED_BIT setting, but hasn't yet reached the callbacks.
> 
> second thread in nouveau_cli_work_ready, where it sees the fence is signalled, so then puts the
> fence, cleanups the object and frees the work item, which contains the callback.
> 
> thread one goes again and tries to call the callback and causes the use-after-free.
> 
> Proposed fix:
> lock the fence signalled check in nouveau_cli_work_ready, so either the callbacks are done
> or the memory is freed.
> 
> Signed-off-by: Dave Airlie <airlied@redhat.com>
> ---
>  drivers/gpu/drm/nouveau/nouveau_drm.c | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c
> index cc7c5b4a05fd..1a45be769848 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_drm.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
> @@ -137,10 +137,16 @@ nouveau_name(struct drm_device *dev)
>  static inline bool
>  nouveau_cli_work_ready(struct dma_fence *fence)
>  {
> -	if (!dma_fence_is_signaled(fence))
> -		return false;
> -	dma_fence_put(fence);
> -	return true;
> +	unsigned long flags;
> +	bool ret = true;
> +	spin_lock_irqsave(fence->lock, flags);

This function is only ever called from worker (process) context, so
there is no point in saving the current irq state. This can be a plain
spin_lock_irq.

Regards,
Lucas

> +	if (!dma_fence_is_signaled_locked(fence))
> +		ret = false;
> +	spin_unlock_irqrestore(fence->lock, flags);
> +
> +	if (ret == true)
> +		dma_fence_put(fence);
> +	return ret;
>  }
>  
>  static void

diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c
index cc7c5b4a05fd..1a45be769848 100644
--- a/drivers/gpu/drm/nouveau/nouveau_drm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
@@ -137,10 +137,16 @@  nouveau_name(struct drm_device *dev)
 static inline bool
 nouveau_cli_work_ready(struct dma_fence *fence)
 {
-	if (!dma_fence_is_signaled(fence))
-		return false;
-	dma_fence_put(fence);
-	return true;
+	unsigned long flags;
+	bool ret = true;
+	spin_lock_irqsave(fence->lock, flags);
+	if (!dma_fence_is_signaled_locked(fence))
+		ret = false;
+	spin_unlock_irqrestore(fence->lock, flags);
+
+	if (ret == true)
+		dma_fence_put(fence);
+	return ret;
 }
 
 static void

nouveau: fix client work fence deletion race

Commit Message

Comments

Patch