[RFC,01/29] dma-fence: Add dma_fence_preempt base class

Message ID	20241118233757.2374041-2-matthew.brost@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: Matthew Brost <matthew.brost@intel.com> To: intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org Cc: kenneth.w.graunke@intel.com, lionel.g.landwerlin@intel.com, jose.souza@intel.com, simona.vetter@ffwll.ch, thomas.hellstrom@linux.intel.com, boris.brezillon@collabora.com, airlied@gmail.com, christian.koenig@amd.com, mihail.atanassov@arm.com, steven.price@arm.com, shashank.sharma@amd.com Subject: [RFC PATCH 01/29] dma-fence: Add dma_fence_preempt base class Date: Mon, 18 Nov 2024 15:37:29 -0800 Message-Id: <20241118233757.2374041-2-matthew.brost@intel.com> In-Reply-To: <20241118233757.2374041-1-matthew.brost@intel.com> References: <20241118233757.2374041-1-matthew.brost@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	UMD direct submission in Xe \| expand [RFC,00/29] UMD direct submission in Xe [RFC,01/29] dma-fence: Add dma_fence_preempt base class [RFC,02/29] dma-fence: Add dma_fence_user_fence [RFC,03/29] drm/xe: Use dma_fence_preempt base class [RFC,04/29] drm/xe: Allocate doorbells for UMD exec queues [RFC,05/29] drm/xe: Add doorbell ID to snapshot capture [RFC,06/29] drm/xe: Break submission ring out into its own BO [RFC,07/29] drm/xe: Break indirect ring state out into its own BO [RFC,08/29] drm/xe: Clear GGTT in xe_bo_restore_kernel [RFC,09/29] FIXME: drm/xe: Add pad to ring and indirect state [RFC,10/29] drm/xe: Enable indirect ring on media GT [RFC,11/29] drm/xe: Don't add pinned mappings to VM bulk move [RFC,12/29] drm/xe: Add exec queue post init extension processing [RFC,13/29] drm/xe/mmap: Add mmap support for PCI memory barrier [RFC,14/29] drm/xe: Add support for mmapping doorbells to user space [RFC,15/29] drm/xe: Add support for mmapping submission ring and indirect ring state to user space [RFC,16/29] drm/xe/uapi: Define UMD exec queue mapping uAPI [RFC,17/29] drm/xe: Add usermap exec queue extension [RFC,18/29] drm/xe: Drop EXEC_QUEUE_FLAG_UMD_SUBMISSION flag [RFC,19/29] drm/xe: Do not allow usermap exec queues in exec IOCTL [RFC,20/29] drm/xe: Teach GuC backend to kill usermap queues [RFC,21/29] drm/xe: Enable preempt fences on usermap queues [RFC,22/29] drm/xe/uapi: Add uAPI to convert user semaphore to / from drm syncobj [RFC,23/29] drm/xe: Add user fence IRQ handler [RFC,24/29] drm/xe: Add xe_hw_fence_user_init [RFC,25/29] drm/xe: Add a message lock to the Xe GPU scheduler [RFC,26/29] drm/xe: Always wait on preempt fences in vma_check_userptr [RFC,27/29] drm/xe: Teach xe_sync layer about drm_xe_semaphore [RFC,28/29] drm/xe: Add VM convert fence IOCTL [RFC,29/29] drm/xe: Add user fence TDR

Matthew Brost Nov. 18, 2024, 11:37 p.m. UTC

Add a dma_fence_preempt base class with driver ops to implement
preemption, based on the existing Xe preemptive fence implementation.

Annotated to ensure correct driver usage.

Cc: Dave Airlie <airlied@redhat.com>
Cc: Simona Vetter <simona.vetter@ffwll.ch>
Cc: Christian Koenig <christian.koenig@amd.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/dma-buf/Makefile            |   2 +-
 drivers/dma-buf/dma-fence-preempt.c | 133 ++++++++++++++++++++++++++++
 include/linux/dma-fence-preempt.h   |  56 ++++++++++++
 3 files changed, 190 insertions(+), 1 deletion(-)
 create mode 100644 drivers/dma-buf/dma-fence-preempt.c
 create mode 100644 include/linux/dma-fence-preempt.h

Christian König Nov. 20, 2024, 1:31 p.m. UTC | #1

Am 19.11.24 um 00:37 schrieb Matthew Brost:
> Add a dma_fence_preempt base class with driver ops to implement
> preemption, based on the existing Xe preemptive fence implementation.
>
> Annotated to ensure correct driver usage.
>
> Cc: Dave Airlie <airlied@redhat.com>
> Cc: Simona Vetter <simona.vetter@ffwll.ch>
> Cc: Christian Koenig <christian.koenig@amd.com>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/dma-buf/Makefile            |   2 +-
>   drivers/dma-buf/dma-fence-preempt.c | 133 ++++++++++++++++++++++++++++
>   include/linux/dma-fence-preempt.h   |  56 ++++++++++++
>   3 files changed, 190 insertions(+), 1 deletion(-)
>   create mode 100644 drivers/dma-buf/dma-fence-preempt.c
>   create mode 100644 include/linux/dma-fence-preempt.h
>
> diff --git a/drivers/dma-buf/Makefile b/drivers/dma-buf/Makefile
> index 70ec901edf2c..c25500bb38b5 100644
> --- a/drivers/dma-buf/Makefile
> +++ b/drivers/dma-buf/Makefile
> @@ -1,6 +1,6 @@
>   # SPDX-License-Identifier: GPL-2.0-only
>   obj-y := dma-buf.o dma-fence.o dma-fence-array.o dma-fence-chain.o \
> -	 dma-fence-unwrap.o dma-resv.o
> +	 dma-fence-preempt.o dma-fence-unwrap.o dma-resv.o
>   obj-$(CONFIG_DMABUF_HEAPS)	+= dma-heap.o
>   obj-$(CONFIG_DMABUF_HEAPS)	+= heaps/
>   obj-$(CONFIG_SYNC_FILE)		+= sync_file.o
> diff --git a/drivers/dma-buf/dma-fence-preempt.c b/drivers/dma-buf/dma-fence-preempt.c
> new file mode 100644
> index 000000000000..6e6ce7ea7421
> --- /dev/null
> +++ b/drivers/dma-buf/dma-fence-preempt.c
> @@ -0,0 +1,133 @@
> +// SPDX-License-Identifier: MIT
> +/*
> + * Copyright © 2024 Intel Corporation
> + */
> +
> +#include <linux/dma-fence-preempt.h>
> +#include <linux/dma-resv.h>
> +
> +static void dma_fence_preempt_work_func(struct work_struct *w)
> +{
> +	bool cookie = dma_fence_begin_signalling();
> +	struct dma_fence_preempt *pfence =
> +		container_of(w, typeof(*pfence), work);
> +	const struct dma_fence_preempt_ops *ops = pfence->ops;
> +	int err = pfence->base.error;
> +
> +	if (!err) {
> +		err = ops->preempt_wait(pfence);
> +		if (err)
> +			dma_fence_set_error(&pfence->base, err);
> +	}
> +
> +	dma_fence_signal(&pfence->base);
> +	ops->preempt_finished(pfence);

Why is that callback useful?

> +
> +	dma_fence_end_signalling(cookie);
> +}
> +
> +static const char *
> +dma_fence_preempt_get_driver_name(struct dma_fence *fence)
> +{
> +	return "dma_fence_preempt";
> +}
> +
> +static const char *
> +dma_fence_preempt_get_timeline_name(struct dma_fence *fence)
> +{
> +	return "ordered";
> +}
> +
> +static void dma_fence_preempt_issue(struct dma_fence_preempt *pfence)
> +{
> +	int err;
> +
> +	err = pfence->ops->preempt(pfence);
> +	if (err)
> +		dma_fence_set_error(&pfence->base, err);
> +
> +	queue_work(pfence->wq, &pfence->work);
> +}
> +
> +static void dma_fence_preempt_cb(struct dma_fence *fence,
> +				 struct dma_fence_cb *cb)
> +{
> +	struct dma_fence_preempt *pfence =
> +		container_of(cb, typeof(*pfence), cb);
> +
> +	dma_fence_preempt_issue(pfence);
> +}
> +
> +static void dma_fence_preempt_delay(struct dma_fence_preempt *pfence)
> +{
> +	struct dma_fence *fence;
> +	int err;
> +
> +	fence = pfence->ops->preempt_delay(pfence);

Mhm, why is that useful?

> +	if (WARN_ON_ONCE(!fence || IS_ERR(fence)))
> +		return;
> +
> +	err = dma_fence_add_callback(fence, &pfence->cb, dma_fence_preempt_cb);

You are running into the exactly same bug we had :)

The problem here is that you can't call dma_fence_add_callback() from 
the enable_signaling callback. Background is that the 
fence_ops->enable_signaling callback is called with the spinlock of the 
preemption fence held.

This spinlock can be the same as the one of the user fence, but it could 
also be a different one. Either way calling dma_fence_add_callback() 
would let lockdep print a nice warning.

I tried to solve this by changing the dma_fence code to not call 
enable_signaling with the lock held, we wanted to do that anyway to 
prevent a bunch of issues with driver unload. But I realized that 
getting this upstream would take to long.

Long story short we moved handling the user fence into the work item.

Apart from that looks rather solid to me.

Regards,
Christian.

> +	if (err == -ENOENT)
> +		dma_fence_preempt_issue(pfence);
> +}
> +
> +static bool dma_fence_preempt_enable_signaling(struct dma_fence *fence)
> +{
> +	struct dma_fence_preempt *pfence =
> +		container_of(fence, typeof(*pfence), base);
> +
> +	if (pfence->ops->preempt_delay)
> +		dma_fence_preempt_delay(pfence);
> +	else
> +		dma_fence_preempt_issue(pfence);
> +
> +	return true;
> +}
> +
> +static const struct dma_fence_ops preempt_fence_ops = {
> +	.get_driver_name = dma_fence_preempt_get_driver_name,
> +	.get_timeline_name = dma_fence_preempt_get_timeline_name,
> +	.enable_signaling = dma_fence_preempt_enable_signaling,
> +};
> +
> +/**
> + * dma_fence_is_preempt() - Is preempt fence
> + *
> + * @fence: Preempt fence
> + *
> + * Return: True if preempt fence, False otherwise
> + */
> +bool dma_fence_is_preempt(const struct dma_fence *fence)
> +{
> +	return fence->ops == &preempt_fence_ops;
> +}
> +EXPORT_SYMBOL(dma_fence_is_preempt);
> +
> +/**
> + * dma_fence_preempt_init() - Initial preempt fence
> + *
> + * @fence: Preempt fence
> + * @ops: Preempt fence operations
> + * @wq: Work queue for preempt wait, should have WQ_MEM_RECLAIM set
> + * @context: Fence context
> + * @seqno: Fence seqence number
> + */
> +void dma_fence_preempt_init(struct dma_fence_preempt *fence,
> +			    const struct dma_fence_preempt_ops *ops,
> +			    struct workqueue_struct *wq,
> +			    u64 context, u64 seqno)
> +{
> +	/*
> +	 * XXX: We really want to check wq for WQ_MEM_RECLAIM here but
> +	 * workqueue_struct is private.
> +	 */
> +
> +	fence->ops = ops;
> +	fence->wq = wq;
> +	INIT_WORK(&fence->work, dma_fence_preempt_work_func);
> +	spin_lock_init(&fence->lock);
> +	dma_fence_init(&fence->base, &preempt_fence_ops,
> +		       &fence->lock, context, seqno);
> +}
> +EXPORT_SYMBOL(dma_fence_preempt_init);
> diff --git a/include/linux/dma-fence-preempt.h b/include/linux/dma-fence-preempt.h
> new file mode 100644
> index 000000000000..28d803f89527
> --- /dev/null
> +++ b/include/linux/dma-fence-preempt.h
> @@ -0,0 +1,56 @@
> +/* SPDX-License-Identifier: MIT */
> +/*
> + * Copyright © 2024 Intel Corporation
> + */
> +
> +#ifndef __LINUX_DMA_FENCE_PREEMPT_H
> +#define __LINUX_DMA_FENCE_PREEMPT_H
> +
> +#include <linux/dma-fence.h>
> +#include <linux/workqueue.h>
> +
> +struct dma_fence_preempt;
> +struct dma_resv;
> +
> +/**
> + * struct dma_fence_preempt_ops - Preempt fence operations
> + *
> + * These functions should be implemented in the driver side.
> + */
> +struct dma_fence_preempt_ops {
> +	/** @preempt_delay: Preempt execution with a delay */
> +	struct dma_fence *(*preempt_delay)(struct dma_fence_preempt *fence);
> +	/** @preempt: Preempt execution */
> +	int (*preempt)(struct dma_fence_preempt *fence);
> +	/** @preempt_wait: Wait for preempt of execution to complete */
> +	int (*preempt_wait)(struct dma_fence_preempt *fence);
> +	/** @preempt_finished: Signal that the preempt has finished */
> +	void (*preempt_finished)(struct dma_fence_preempt *fence);
> +};
> +
> +/**
> + * struct dma_fence_preempt - Embedded preempt fence base class
> + */
> +struct dma_fence_preempt {
> +	/** @base: Fence base class */
> +	struct dma_fence base;
> +	/** @lock: Spinlock for fence handling */
> +	spinlock_t lock;
> +	/** @cb: Callback preempt delay */
> +	struct dma_fence_cb cb;
> +	/** @ops: Preempt fence operation */
> +	const struct dma_fence_preempt_ops *ops;
> +	/** @wq: Work queue for preempt wait */
> +	struct workqueue_struct *wq;
> +	/** @work: Work struct for preempt wait */
> +	struct work_struct work;
> +};
> +
> +bool dma_fence_is_preempt(const struct dma_fence *fence);
> +
> +void dma_fence_preempt_init(struct dma_fence_preempt *fence,
> +			    const struct dma_fence_preempt_ops *ops,
> +			    struct workqueue_struct *wq,
> +			    u64 context, u64 seqno);
> +
> +#endif

Matthew Brost Nov. 20, 2024, 5:36 p.m. UTC | #2

On Wed, Nov 20, 2024 at 02:31:50PM +0100, Christian König wrote:
> Am 19.11.24 um 00:37 schrieb Matthew Brost:
> > Add a dma_fence_preempt base class with driver ops to implement
> > preemption, based on the existing Xe preemptive fence implementation.
> > 
> > Annotated to ensure correct driver usage.
> > 
> > Cc: Dave Airlie <airlied@redhat.com>
> > Cc: Simona Vetter <simona.vetter@ffwll.ch>
> > Cc: Christian Koenig <christian.koenig@amd.com>
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >   drivers/dma-buf/Makefile            |   2 +-
> >   drivers/dma-buf/dma-fence-preempt.c | 133 ++++++++++++++++++++++++++++
> >   include/linux/dma-fence-preempt.h   |  56 ++++++++++++
> >   3 files changed, 190 insertions(+), 1 deletion(-)
> >   create mode 100644 drivers/dma-buf/dma-fence-preempt.c
> >   create mode 100644 include/linux/dma-fence-preempt.h
> > 
> > diff --git a/drivers/dma-buf/Makefile b/drivers/dma-buf/Makefile
> > index 70ec901edf2c..c25500bb38b5 100644
> > --- a/drivers/dma-buf/Makefile
> > +++ b/drivers/dma-buf/Makefile
> > @@ -1,6 +1,6 @@
> >   # SPDX-License-Identifier: GPL-2.0-only
> >   obj-y := dma-buf.o dma-fence.o dma-fence-array.o dma-fence-chain.o \
> > -	 dma-fence-unwrap.o dma-resv.o
> > +	 dma-fence-preempt.o dma-fence-unwrap.o dma-resv.o
> >   obj-$(CONFIG_DMABUF_HEAPS)	+= dma-heap.o
> >   obj-$(CONFIG_DMABUF_HEAPS)	+= heaps/
> >   obj-$(CONFIG_SYNC_FILE)		+= sync_file.o
> > diff --git a/drivers/dma-buf/dma-fence-preempt.c b/drivers/dma-buf/dma-fence-preempt.c
> > new file mode 100644
> > index 000000000000..6e6ce7ea7421
> > --- /dev/null
> > +++ b/drivers/dma-buf/dma-fence-preempt.c
> > @@ -0,0 +1,133 @@
> > +// SPDX-License-Identifier: MIT
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + */
> > +
> > +#include <linux/dma-fence-preempt.h>
> > +#include <linux/dma-resv.h>
> > +
> > +static void dma_fence_preempt_work_func(struct work_struct *w)
> > +{
> > +	bool cookie = dma_fence_begin_signalling();
> > +	struct dma_fence_preempt *pfence =
> > +		container_of(w, typeof(*pfence), work);
> > +	const struct dma_fence_preempt_ops *ops = pfence->ops;
> > +	int err = pfence->base.error;
> > +
> > +	if (!err) {
> > +		err = ops->preempt_wait(pfence);
> > +		if (err)
> > +			dma_fence_set_error(&pfence->base, err);
> > +	}
> > +
> > +	dma_fence_signal(&pfence->base);
> > +	ops->preempt_finished(pfence);
> 
> Why is that callback useful?
> 

In Xe, this is where we kick the resume worker and drop a ref to the
preemption object, which in Xe is an individual queue, and in AMD it is
a VM, right? wrt preemption object, I've reasoned this should work for
an either per queue or VM driver design of preempt fences.

This part likely could be moved into the preempt_wait callback though
but would get a little goofy in the error case if preempt_wait is not
called as the driver side would still need to cleanup a ref. Maybe I
don't even need a ref though - have to think that through - but for
general safety we typically like to take refs whenever a fence
references a different object.

> > +
> > +	dma_fence_end_signalling(cookie);
> > +}
> > +
> > +static const char *
> > +dma_fence_preempt_get_driver_name(struct dma_fence *fence)
> > +{
> > +	return "dma_fence_preempt";
> > +}
> > +
> > +static const char *
> > +dma_fence_preempt_get_timeline_name(struct dma_fence *fence)
> > +{
> > +	return "ordered";
> > +}
> > +
> > +static void dma_fence_preempt_issue(struct dma_fence_preempt *pfence)
> > +{
> > +	int err;
> > +
> > +	err = pfence->ops->preempt(pfence);
> > +	if (err)
> > +		dma_fence_set_error(&pfence->base, err);
> > +
> > +	queue_work(pfence->wq, &pfence->work);
> > +}
> > +
> > +static void dma_fence_preempt_cb(struct dma_fence *fence,
> > +				 struct dma_fence_cb *cb)
> > +{
> > +	struct dma_fence_preempt *pfence =
> > +		container_of(cb, typeof(*pfence), cb);
> > +
> > +	dma_fence_preempt_issue(pfence);
> > +}
> > +
> > +static void dma_fence_preempt_delay(struct dma_fence_preempt *pfence)
> > +{
> > +	struct dma_fence *fence;
> > +	int err;
> > +
> > +	fence = pfence->ops->preempt_delay(pfence);
> 
> Mhm, why is that useful?
> 

This for attaching the preempt object's last exported fence which needs
to be signaled before the preemption is issued. So for purely long
running VM's, this function could be NULL. For VM's with user queues +
dma fences, the driver returns the last fence from the convert user
fence to dma-fence IOCTL.

I realized my kernel doc doesn't explain this as well as it should, I
have already made this more verbose locally and hopefully it clearly
explains all of this.

> > +	if (WARN_ON_ONCE(!fence || IS_ERR(fence)))
> > +		return;
> > +
> > +	err = dma_fence_add_callback(fence, &pfence->cb, dma_fence_preempt_cb);
> 
> You are running into the exactly same bug we had :)
> 
> The problem here is that you can't call dma_fence_add_callback() from the
> enable_signaling callback. Background is that the
> fence_ops->enable_signaling callback is called with the spinlock of the
> preemption fence held.
> 
> This spinlock can be the same as the one of the user fence, but it could
> also be a different one. Either way calling dma_fence_add_callback() would
> let lockdep print a nice warning.
> 

Hmm, I see the problem if you share a lock between the preempt fence and
last exported fence but as long as these locks are seperate I don't see
the issue.

The locking order then is:

preempt fence lock -> last exported fence lock.

Lockdep does not explode in Xe but maybe can buy this is a little
unsafe. We could always move preempt_delay to the worker, attach a CB,
and rekick the worker upon the last fence signaling if you think that is
safer. Of course we could always just directly wait on the returned last
fence in the worker too.

> I tried to solve this by changing the dma_fence code to not call
> enable_signaling with the lock held, we wanted to do that anyway to prevent
> a bunch of issues with driver unload. But I realized that getting this
> upstream would take to long.
> 
> Long story short we moved handling the user fence into the work item.
> 

I did run into an issue when trying to make preempt_wait after return a
fence + attach a CB, and signal this preempt fence from the CB. I got
locking inversions almost worked through them but eventually gave up and
stuck with the worker.

Matt 

> Apart from that looks rather solid to me.
> 
> Regards,
> Christian.
> 
> > +	if (err == -ENOENT)
> > +		dma_fence_preempt_issue(pfence);
> > +}
> > +
> > +static bool dma_fence_preempt_enable_signaling(struct dma_fence *fence)
> > +{
> > +	struct dma_fence_preempt *pfence =
> > +		container_of(fence, typeof(*pfence), base);
> > +
> > +	if (pfence->ops->preempt_delay)
> > +		dma_fence_preempt_delay(pfence);
> > +	else
> > +		dma_fence_preempt_issue(pfence);
> > +
> > +	return true;
> > +}
> > +
> > +static const struct dma_fence_ops preempt_fence_ops = {
> > +	.get_driver_name = dma_fence_preempt_get_driver_name,
> > +	.get_timeline_name = dma_fence_preempt_get_timeline_name,
> > +	.enable_signaling = dma_fence_preempt_enable_signaling,
> > +};
> > +
> > +/**
> > + * dma_fence_is_preempt() - Is preempt fence
> > + *
> > + * @fence: Preempt fence
> > + *
> > + * Return: True if preempt fence, False otherwise
> > + */
> > +bool dma_fence_is_preempt(const struct dma_fence *fence)
> > +{
> > +	return fence->ops == &preempt_fence_ops;
> > +}
> > +EXPORT_SYMBOL(dma_fence_is_preempt);
> > +
> > +/**
> > + * dma_fence_preempt_init() - Initial preempt fence
> > + *
> > + * @fence: Preempt fence
> > + * @ops: Preempt fence operations
> > + * @wq: Work queue for preempt wait, should have WQ_MEM_RECLAIM set
> > + * @context: Fence context
> > + * @seqno: Fence seqence number
> > + */
> > +void dma_fence_preempt_init(struct dma_fence_preempt *fence,
> > +			    const struct dma_fence_preempt_ops *ops,
> > +			    struct workqueue_struct *wq,
> > +			    u64 context, u64 seqno)
> > +{
> > +	/*
> > +	 * XXX: We really want to check wq for WQ_MEM_RECLAIM here but
> > +	 * workqueue_struct is private.
> > +	 */
> > +
> > +	fence->ops = ops;
> > +	fence->wq = wq;
> > +	INIT_WORK(&fence->work, dma_fence_preempt_work_func);
> > +	spin_lock_init(&fence->lock);
> > +	dma_fence_init(&fence->base, &preempt_fence_ops,
> > +		       &fence->lock, context, seqno);
> > +}
> > +EXPORT_SYMBOL(dma_fence_preempt_init);
> > diff --git a/include/linux/dma-fence-preempt.h b/include/linux/dma-fence-preempt.h
> > new file mode 100644
> > index 000000000000..28d803f89527
> > --- /dev/null
> > +++ b/include/linux/dma-fence-preempt.h
> > @@ -0,0 +1,56 @@
> > +/* SPDX-License-Identifier: MIT */
> > +/*
> > + * Copyright © 2024 Intel Corporation
> > + */
> > +
> > +#ifndef __LINUX_DMA_FENCE_PREEMPT_H
> > +#define __LINUX_DMA_FENCE_PREEMPT_H
> > +
> > +#include <linux/dma-fence.h>
> > +#include <linux/workqueue.h>
> > +
> > +struct dma_fence_preempt;
> > +struct dma_resv;
> > +
> > +/**
> > + * struct dma_fence_preempt_ops - Preempt fence operations
> > + *
> > + * These functions should be implemented in the driver side.
> > + */
> > +struct dma_fence_preempt_ops {
> > +	/** @preempt_delay: Preempt execution with a delay */
> > +	struct dma_fence *(*preempt_delay)(struct dma_fence_preempt *fence);
> > +	/** @preempt: Preempt execution */
> > +	int (*preempt)(struct dma_fence_preempt *fence);
> > +	/** @preempt_wait: Wait for preempt of execution to complete */
> > +	int (*preempt_wait)(struct dma_fence_preempt *fence);
> > +	/** @preempt_finished: Signal that the preempt has finished */
> > +	void (*preempt_finished)(struct dma_fence_preempt *fence);
> > +};
> > +
> > +/**
> > + * struct dma_fence_preempt - Embedded preempt fence base class
> > + */
> > +struct dma_fence_preempt {
> > +	/** @base: Fence base class */
> > +	struct dma_fence base;
> > +	/** @lock: Spinlock for fence handling */
> > +	spinlock_t lock;
> > +	/** @cb: Callback preempt delay */
> > +	struct dma_fence_cb cb;
> > +	/** @ops: Preempt fence operation */
> > +	const struct dma_fence_preempt_ops *ops;
> > +	/** @wq: Work queue for preempt wait */
> > +	struct workqueue_struct *wq;
> > +	/** @work: Work struct for preempt wait */
> > +	struct work_struct work;
> > +};
> > +
> > +bool dma_fence_is_preempt(const struct dma_fence *fence);
> > +
> > +void dma_fence_preempt_init(struct dma_fence_preempt *fence,
> > +			    const struct dma_fence_preempt_ops *ops,
> > +			    struct workqueue_struct *wq,
> > +			    u64 context, u64 seqno);
> > +
> > +#endif
>

Christian König Nov. 21, 2024, 10:04 a.m. UTC | #3

Am 20.11.24 um 18:36 schrieb Matthew Brost:
> On Wed, Nov 20, 2024 at 02:31:50PM +0100, Christian König wrote:
>> Am 19.11.24 um 00:37 schrieb Matthew Brost:
>>> Add a dma_fence_preempt base class with driver ops to implement
>>> preemption, based on the existing Xe preemptive fence implementation.
>>>
>>> Annotated to ensure correct driver usage.
>>>
>>> Cc: Dave Airlie<airlied@redhat.com>
>>> Cc: Simona Vetter<simona.vetter@ffwll.ch>
>>> Cc: Christian Koenig<christian.koenig@amd.com>
>>> Signed-off-by: Matthew Brost<matthew.brost@intel.com>
>>> ---
>>>    drivers/dma-buf/Makefile            |   2 +-
>>>    drivers/dma-buf/dma-fence-preempt.c | 133 ++++++++++++++++++++++++++++
>>>    include/linux/dma-fence-preempt.h   |  56 ++++++++++++
>>>    3 files changed, 190 insertions(+), 1 deletion(-)
>>>    create mode 100644 drivers/dma-buf/dma-fence-preempt.c
>>>    create mode 100644 include/linux/dma-fence-preempt.h
>>>
>>> diff --git a/drivers/dma-buf/Makefile b/drivers/dma-buf/Makefile
>>> index 70ec901edf2c..c25500bb38b5 100644
>>> --- a/drivers/dma-buf/Makefile
>>> +++ b/drivers/dma-buf/Makefile
>>> @@ -1,6 +1,6 @@
>>>    # SPDX-License-Identifier: GPL-2.0-only
>>>    obj-y := dma-buf.o dma-fence.o dma-fence-array.o dma-fence-chain.o \
>>> -	 dma-fence-unwrap.o dma-resv.o
>>> +	 dma-fence-preempt.o dma-fence-unwrap.o dma-resv.o
>>>    obj-$(CONFIG_DMABUF_HEAPS)	+= dma-heap.o
>>>    obj-$(CONFIG_DMABUF_HEAPS)	+= heaps/
>>>    obj-$(CONFIG_SYNC_FILE)		+= sync_file.o
>>> diff --git a/drivers/dma-buf/dma-fence-preempt.c b/drivers/dma-buf/dma-fence-preempt.c
>>> new file mode 100644
>>> index 000000000000..6e6ce7ea7421
>>> --- /dev/null
>>> +++ b/drivers/dma-buf/dma-fence-preempt.c
>>> @@ -0,0 +1,133 @@
>>> +// SPDX-License-Identifier: MIT
>>> +/*
>>> + * Copyright © 2024 Intel Corporation
>>> + */
>>> +
>>> +#include <linux/dma-fence-preempt.h>
>>> +#include <linux/dma-resv.h>
>>> +
>>> +static void dma_fence_preempt_work_func(struct work_struct *w)
>>> +{
>>> +	bool cookie = dma_fence_begin_signalling();
>>> +	struct dma_fence_preempt *pfence =
>>> +		container_of(w, typeof(*pfence), work);
>>> +	const struct dma_fence_preempt_ops *ops = pfence->ops;
>>> +	int err = pfence->base.error;
>>> +
>>> +	if (!err) {
>>> +		err = ops->preempt_wait(pfence);
>>> +		if (err)
>>> +			dma_fence_set_error(&pfence->base, err);
>>> +	}
>>> +
>>> +	dma_fence_signal(&pfence->base);
>>> +	ops->preempt_finished(pfence);
>> Why is that callback useful?
>>
> In Xe, this is where we kick the resume worker and drop a ref to the
> preemption object, which in Xe is an individual queue, and in AMD it is
> a VM, right?

Correct. The whole VM is preempted since we don't know which queue is 
using which BO.

> wrt preemption object, I've reasoned this should work for
> an either per queue or VM driver design of preempt fences.
>
> This part likely could be moved into the preempt_wait callback though
> but would get a little goofy in the error case if preempt_wait is not
> called as the driver side would still need to cleanup a ref. Maybe I
> don't even need a ref though - have to think that through - but for
> general safety we typically like to take refs whenever a fence
> references a different object.

The tricky part is that at least for us we need to do this *before* the 
fence is signaled.

This way we can do something like:

retry:
     mutex_lock(&lock);
     if (dma_fence_is_signaled(preemept_fence)) {
         mutex_unlock(&lock);
         flush_work(resume_work);
         gotot retry;
     }


To make sure that we have a valid and working VM before we publish the 
user fence anywhere and preempting the VM will wait for this user fence 
to complete.

>
>>> +
>>> +	dma_fence_end_signalling(cookie);
>>> +}
>>> +
>>> +static const char *
>>> +dma_fence_preempt_get_driver_name(struct dma_fence *fence)
>>> +{
>>> +	return "dma_fence_preempt";
>>> +}
>>> +
>>> +static const char *
>>> +dma_fence_preempt_get_timeline_name(struct dma_fence *fence)
>>> +{
>>> +	return "ordered";
>>> +}
>>> +
>>> +static void dma_fence_preempt_issue(struct dma_fence_preempt *pfence)
>>> +{
>>> +	int err;
>>> +
>>> +	err = pfence->ops->preempt(pfence);
>>> +	if (err)
>>> +		dma_fence_set_error(&pfence->base, err);
>>> +
>>> +	queue_work(pfence->wq, &pfence->work);
>>> +}
>>> +
>>> +static void dma_fence_preempt_cb(struct dma_fence *fence,
>>> +				 struct dma_fence_cb *cb)
>>> +{
>>> +	struct dma_fence_preempt *pfence =
>>> +		container_of(cb, typeof(*pfence), cb);
>>> +
>>> +	dma_fence_preempt_issue(pfence);
>>> +}
>>> +
>>> +static void dma_fence_preempt_delay(struct dma_fence_preempt *pfence)
>>> +{
>>> +	struct dma_fence *fence;
>>> +	int err;
>>> +
>>> +	fence = pfence->ops->preempt_delay(pfence);
>> Mhm, why is that useful?
>>
> This for attaching the preempt object's last exported fence which needs
> to be signaled before the preemption is issued. So for purely long
> running VM's, this function could be NULL. For VM's with user queues +
> dma fences, the driver returns the last fence from the convert user
> fence to dma-fence IOCTL.
>
> I realized my kernel doc doesn't explain this as well as it should, I
> have already made this more verbose locally and hopefully it clearly
> explains all of this.

That part was actually obvious. But I would expected that to be push 
interface instead of a pull interface.

E.g. the preemption fence would also provide something like a manager 
object which has a mutex, the last exported user fence and the necessary 
functionality to update this user fence.

The tricky part is really to get this dance right between signaling the 
preemption fence and not allowing installing a new user fence before the 
resume worker has re-created the VM.

>>> +	if (WARN_ON_ONCE(!fence || IS_ERR(fence)))
>>> +		return;
>>> +
>>> +	err = dma_fence_add_callback(fence, &pfence->cb, dma_fence_preempt_cb);
>> You are running into the exactly same bug we had :)
>>
>> The problem here is that you can't call dma_fence_add_callback() from the
>> enable_signaling callback. Background is that the
>> fence_ops->enable_signaling callback is called with the spinlock of the
>> preemption fence held.
>>
>> This spinlock can be the same as the one of the user fence, but it could
>> also be a different one. Either way calling dma_fence_add_callback() would
>> let lockdep print a nice warning.
>>
> Hmm, I see the problem if you share a lock between the preempt fence and
> last exported fence but as long as these locks are seperate I don't see
> the issue.
>
> The locking order then is:
>
> preempt fence lock -> last exported fence lock.

You would need to annotate that as nested lock for lockdep and the 
dma_fence framework currently doesn't allow that.

> Lockdep does not explode in Xe but maybe can buy this is a little
> unsafe. We could always move preempt_delay to the worker, attach a CB,
> and rekick the worker upon the last fence signaling if you think that is
> safer. Of course we could always just directly wait on the returned last
> fence in the worker too.

Yeah I that is basically what we do at the moment since you also need to 
make sure that no new user fence is installed while you wait for the 
latest to signal.

Regards,
Christian.

>
>> I tried to solve this by changing the dma_fence code to not call
>> enable_signaling with the lock held, we wanted to do that anyway to prevent
>> a bunch of issues with driver unload. But I realized that getting this
>> upstream would take to long.
>>
>> Long story short we moved handling the user fence into the work item.
>>
> I did run into an issue when trying to make preempt_wait after return a
> fence + attach a CB, and signal this preempt fence from the CB. I got
> locking inversions almost worked through them but eventually gave up and
> stuck with the worker.
>
> Matt
>
>> Apart from that looks rather solid to me.
>>
>> Regards,
>> Christian.
>>
>>> +	if (err == -ENOENT)
>>> +		dma_fence_preempt_issue(pfence);
>>> +}
>>> +
>>> +static bool dma_fence_preempt_enable_signaling(struct dma_fence *fence)
>>> +{
>>> +	struct dma_fence_preempt *pfence =
>>> +		container_of(fence, typeof(*pfence), base);
>>> +
>>> +	if (pfence->ops->preempt_delay)
>>> +		dma_fence_preempt_delay(pfence);
>>> +	else
>>> +		dma_fence_preempt_issue(pfence);
>>> +
>>> +	return true;
>>> +}
>>> +
>>> +static const struct dma_fence_ops preempt_fence_ops = {
>>> +	.get_driver_name = dma_fence_preempt_get_driver_name,
>>> +	.get_timeline_name = dma_fence_preempt_get_timeline_name,
>>> +	.enable_signaling = dma_fence_preempt_enable_signaling,
>>> +};
>>> +
>>> +/**
>>> + * dma_fence_is_preempt() - Is preempt fence
>>> + *
>>> + * @fence: Preempt fence
>>> + *
>>> + * Return: True if preempt fence, False otherwise
>>> + */
>>> +bool dma_fence_is_preempt(const struct dma_fence *fence)
>>> +{
>>> +	return fence->ops == &preempt_fence_ops;
>>> +}
>>> +EXPORT_SYMBOL(dma_fence_is_preempt);
>>> +
>>> +/**
>>> + * dma_fence_preempt_init() - Initial preempt fence
>>> + *
>>> + * @fence: Preempt fence
>>> + * @ops: Preempt fence operations
>>> + * @wq: Work queue for preempt wait, should have WQ_MEM_RECLAIM set
>>> + * @context: Fence context
>>> + * @seqno: Fence seqence number
>>> + */
>>> +void dma_fence_preempt_init(struct dma_fence_preempt *fence,
>>> +			    const struct dma_fence_preempt_ops *ops,
>>> +			    struct workqueue_struct *wq,
>>> +			    u64 context, u64 seqno)
>>> +{
>>> +	/*
>>> +	 * XXX: We really want to check wq for WQ_MEM_RECLAIM here but
>>> +	 * workqueue_struct is private.
>>> +	 */
>>> +
>>> +	fence->ops = ops;
>>> +	fence->wq = wq;
>>> +	INIT_WORK(&fence->work, dma_fence_preempt_work_func);
>>> +	spin_lock_init(&fence->lock);
>>> +	dma_fence_init(&fence->base, &preempt_fence_ops,
>>> +		       &fence->lock, context, seqno);
>>> +}
>>> +EXPORT_SYMBOL(dma_fence_preempt_init);
>>> diff --git a/include/linux/dma-fence-preempt.h b/include/linux/dma-fence-preempt.h
>>> new file mode 100644
>>> index 000000000000..28d803f89527
>>> --- /dev/null
>>> +++ b/include/linux/dma-fence-preempt.h
>>> @@ -0,0 +1,56 @@
>>> +/* SPDX-License-Identifier: MIT */
>>> +/*
>>> + * Copyright © 2024 Intel Corporation
>>> + */
>>> +
>>> +#ifndef __LINUX_DMA_FENCE_PREEMPT_H
>>> +#define __LINUX_DMA_FENCE_PREEMPT_H
>>> +
>>> +#include <linux/dma-fence.h>
>>> +#include <linux/workqueue.h>
>>> +
>>> +struct dma_fence_preempt;
>>> +struct dma_resv;
>>> +
>>> +/**
>>> + * struct dma_fence_preempt_ops - Preempt fence operations
>>> + *
>>> + * These functions should be implemented in the driver side.
>>> + */
>>> +struct dma_fence_preempt_ops {
>>> +	/** @preempt_delay: Preempt execution with a delay */
>>> +	struct dma_fence *(*preempt_delay)(struct dma_fence_preempt *fence);
>>> +	/** @preempt: Preempt execution */
>>> +	int (*preempt)(struct dma_fence_preempt *fence);
>>> +	/** @preempt_wait: Wait for preempt of execution to complete */
>>> +	int (*preempt_wait)(struct dma_fence_preempt *fence);
>>> +	/** @preempt_finished: Signal that the preempt has finished */
>>> +	void (*preempt_finished)(struct dma_fence_preempt *fence);
>>> +};
>>> +
>>> +/**
>>> + * struct dma_fence_preempt - Embedded preempt fence base class
>>> + */
>>> +struct dma_fence_preempt {
>>> +	/** @base: Fence base class */
>>> +	struct dma_fence base;
>>> +	/** @lock: Spinlock for fence handling */
>>> +	spinlock_t lock;
>>> +	/** @cb: Callback preempt delay */
>>> +	struct dma_fence_cb cb;
>>> +	/** @ops: Preempt fence operation */
>>> +	const struct dma_fence_preempt_ops *ops;
>>> +	/** @wq: Work queue for preempt wait */
>>> +	struct workqueue_struct *wq;
>>> +	/** @work: Work struct for preempt wait */
>>> +	struct work_struct work;
>>> +};
>>> +
>>> +bool dma_fence_is_preempt(const struct dma_fence *fence);
>>> +
>>> +void dma_fence_preempt_init(struct dma_fence_preempt *fence,
>>> +			    const struct dma_fence_preempt_ops *ops,
>>> +			    struct workqueue_struct *wq,
>>> +			    u64 context, u64 seqno);
>>> +
>>> +#endif

Matthew Brost Nov. 21, 2024, 6:41 p.m. UTC | #4

On Thu, Nov 21, 2024 at 11:04:47AM +0100, Christian König wrote:
> Am 20.11.24 um 18:36 schrieb Matthew Brost:
> > On Wed, Nov 20, 2024 at 02:31:50PM +0100, Christian König wrote:
> > > Am 19.11.24 um 00:37 schrieb Matthew Brost:
> > > > Add a dma_fence_preempt base class with driver ops to implement
> > > > preemption, based on the existing Xe preemptive fence implementation.
> > > > 
> > > > Annotated to ensure correct driver usage.
> > > > 
> > > > Cc: Dave Airlie<airlied@redhat.com>
> > > > Cc: Simona Vetter<simona.vetter@ffwll.ch>
> > > > Cc: Christian Koenig<christian.koenig@amd.com>
> > > > Signed-off-by: Matthew Brost<matthew.brost@intel.com>
> > > > ---
> > > >    drivers/dma-buf/Makefile            |   2 +-
> > > >    drivers/dma-buf/dma-fence-preempt.c | 133 ++++++++++++++++++++++++++++
> > > >    include/linux/dma-fence-preempt.h   |  56 ++++++++++++
> > > >    3 files changed, 190 insertions(+), 1 deletion(-)
> > > >    create mode 100644 drivers/dma-buf/dma-fence-preempt.c
> > > >    create mode 100644 include/linux/dma-fence-preempt.h
> > > > 
> > > > diff --git a/drivers/dma-buf/Makefile b/drivers/dma-buf/Makefile
> > > > index 70ec901edf2c..c25500bb38b5 100644
> > > > --- a/drivers/dma-buf/Makefile
> > > > +++ b/drivers/dma-buf/Makefile
> > > > @@ -1,6 +1,6 @@
> > > >    # SPDX-License-Identifier: GPL-2.0-only
> > > >    obj-y := dma-buf.o dma-fence.o dma-fence-array.o dma-fence-chain.o \
> > > > -	 dma-fence-unwrap.o dma-resv.o
> > > > +	 dma-fence-preempt.o dma-fence-unwrap.o dma-resv.o
> > > >    obj-$(CONFIG_DMABUF_HEAPS)	+= dma-heap.o
> > > >    obj-$(CONFIG_DMABUF_HEAPS)	+= heaps/
> > > >    obj-$(CONFIG_SYNC_FILE)		+= sync_file.o
> > > > diff --git a/drivers/dma-buf/dma-fence-preempt.c b/drivers/dma-buf/dma-fence-preempt.c
> > > > new file mode 100644
> > > > index 000000000000..6e6ce7ea7421
> > > > --- /dev/null
> > > > +++ b/drivers/dma-buf/dma-fence-preempt.c
> > > > @@ -0,0 +1,133 @@
> > > > +// SPDX-License-Identifier: MIT
> > > > +/*
> > > > + * Copyright © 2024 Intel Corporation
> > > > + */
> > > > +
> > > > +#include <linux/dma-fence-preempt.h>
> > > > +#include <linux/dma-resv.h>
> > > > +
> > > > +static void dma_fence_preempt_work_func(struct work_struct *w)
> > > > +{
> > > > +	bool cookie = dma_fence_begin_signalling();
> > > > +	struct dma_fence_preempt *pfence =
> > > > +		container_of(w, typeof(*pfence), work);
> > > > +	const struct dma_fence_preempt_ops *ops = pfence->ops;
> > > > +	int err = pfence->base.error;
> > > > +
> > > > +	if (!err) {
> > > > +		err = ops->preempt_wait(pfence);
> > > > +		if (err)
> > > > +			dma_fence_set_error(&pfence->base, err);
> > > > +	}
> > > > +
> > > > +	dma_fence_signal(&pfence->base);
> > > > +	ops->preempt_finished(pfence);
> > > Why is that callback useful?
> > > 
> > In Xe, this is where we kick the resume worker and drop a ref to the
> > preemption object, which in Xe is an individual queue, and in AMD it is
> > a VM, right?
> 
> Correct. The whole VM is preempted since we don't know which queue is using
> which BO.
> 

Right. In Xe we don't know which queue is using a BO either - we trigger
all queues preempt fences attached to the VM effectively preempting the
entire VM. Per VM preempt fence or per queue preempt fence is a driver
choice (I can see arguments for both cases) is the point here and any
base class shouldn't dictate what a driver wants to do.

> > wrt preemption object, I've reasoned this should work for
> > an either per queue or VM driver design of preempt fences.
> > 
> > This part likely could be moved into the preempt_wait callback though
> > but would get a little goofy in the error case if preempt_wait is not
> > called as the driver side would still need to cleanup a ref. Maybe I
> > don't even need a ref though - have to think that through - but for
> > general safety we typically like to take refs whenever a fence
> > references a different object.
> 
> The tricky part is that at least for us we need to do this *before* the
> fence is signaled.

Hmm, I'm a little confused by this. Do you think the code as is missing
somehthing or opposed to keeping the preempt_finished vfunc?

> 
> This way we can do something like:
> 
> retry:
>     mutex_lock(&lock);
>     if (dma_fence_is_signaled(preemept_fence)) {
>         mutex_unlock(&lock);
>         flush_work(resume_work);
>         gotot retry;
>     }
>

This snippet is from your convert user fence to dma fence IOCTL? I think
this makes sense given your design of the conver user to dma fence IOCTL
not actually doing a resume - I landed making that IOCTL basically
another version of the resume worker for simplicity but that may change
if we find locking the entire VM is too costly.

> 
> To make sure that we have a valid and working VM before we publish the user
> fence anywhere and preempting the VM will wait for this user fence to
> complete.
> 

Agree the VM needs be in valid state befor publish a user fence as a dma
fence given a resume requires memory allocations thus breaking dma
fencing rules.

> > 
> > > > +
> > > > +	dma_fence_end_signalling(cookie);
> > > > +}
> > > > +
> > > > +static const char *
> > > > +dma_fence_preempt_get_driver_name(struct dma_fence *fence)
> > > > +{
> > > > +	return "dma_fence_preempt";
> > > > +}
> > > > +
> > > > +static const char *
> > > > +dma_fence_preempt_get_timeline_name(struct dma_fence *fence)
> > > > +{
> > > > +	return "ordered";
> > > > +}
> > > > +
> > > > +static void dma_fence_preempt_issue(struct dma_fence_preempt *pfence)
> > > > +{
> > > > +	int err;
> > > > +
> > > > +	err = pfence->ops->preempt(pfence);
> > > > +	if (err)
> > > > +		dma_fence_set_error(&pfence->base, err);
> > > > +
> > > > +	queue_work(pfence->wq, &pfence->work);
> > > > +}
> > > > +
> > > > +static void dma_fence_preempt_cb(struct dma_fence *fence,
> > > > +				 struct dma_fence_cb *cb)
> > > > +{
> > > > +	struct dma_fence_preempt *pfence =
> > > > +		container_of(cb, typeof(*pfence), cb);
> > > > +
> > > > +	dma_fence_preempt_issue(pfence);
> > > > +}
> > > > +
> > > > +static void dma_fence_preempt_delay(struct dma_fence_preempt *pfence)
> > > > +{
> > > > +	struct dma_fence *fence;
> > > > +	int err;
> > > > +
> > > > +	fence = pfence->ops->preempt_delay(pfence);
> > > Mhm, why is that useful?
> > > 
> > This for attaching the preempt object's last exported fence which needs
> > to be signaled before the preemption is issued. So for purely long
> > running VM's, this function could be NULL. For VM's with user queues +
> > dma fences, the driver returns the last fence from the convert user
> > fence to dma-fence IOCTL.
> > 
> > I realized my kernel doc doesn't explain this as well as it should, I
> > have already made this more verbose locally and hopefully it clearly
> > explains all of this.
> 
> That part was actually obvious. But I would expected that to be push
> interface instead of a pull interface.
> 
> E.g. the preemption fence would also provide something like a manager object
> which has a mutex, the last exported user fence and the necessary
> functionality to update this user fence.
> 

Hmm, I rather like the pull interface. In Xe this is dma-fence chain
attached to the VM. It safe pull given our covert IOCTL takes the VM's
dma-resv locks / notifier locks before publishing the user fence.

In your design couldn't you use spin lock in the last step of publishing
a user fence which checks for sw signaling on the preempt fence, if it
enabled restart the IOCTL waiting the resume worker? Then in this vfunc
pull the fence under the spin lock?

Not opposed to a push interface though if you really think this the way
to go. Quite certain I could make that work for Xe too.

> The tricky part is really to get this dance right between signaling the
> preemption fence and not allowing installing a new user fence before the
> resume worker has re-created the VM.
> 

Yes, indeed this is tricky.

> > > > +	if (WARN_ON_ONCE(!fence || IS_ERR(fence)))
> > > > +		return;
> > > > +
> > > > +	err = dma_fence_add_callback(fence, &pfence->cb, dma_fence_preempt_cb);
> > > You are running into the exactly same bug we had :)
> > > 
> > > The problem here is that you can't call dma_fence_add_callback() from the
> > > enable_signaling callback. Background is that the
> > > fence_ops->enable_signaling callback is called with the spinlock of the
> > > preemption fence held.
> > > 
> > > This spinlock can be the same as the one of the user fence, but it could
> > > also be a different one. Either way calling dma_fence_add_callback() would
> > > let lockdep print a nice warning.
> > > 
> > Hmm, I see the problem if you share a lock between the preempt fence and
> > last exported fence but as long as these locks are seperate I don't see
> > the issue.
> > 
> > The locking order then is:
> > 
> > preempt fence lock -> last exported fence lock.
> 
> You would need to annotate that as nested lock for lockdep and the dma_fence
> framework currently doesn't allow that.
> 

This definitely works as is - I've tested this. If dma-fence's lock was
embedded within the dma-fence, then ofc lockdep would complain without
nesting. It isn't though - the spin lock is passed in as argument so
lockdep can identify the locks for 'preempt fence lock' and 'last
exported fence' as independent locks.

> > Lockdep does not explode in Xe but maybe can buy this is a little
> > unsafe. We could always move preempt_delay to the worker, attach a CB,
> > and rekick the worker upon the last fence signaling if you think that is
> > safer. Of course we could always just directly wait on the returned last
> > fence in the worker too.
> 
> Yeah I that is basically what we do at the moment since you also need to
> make sure that no new user fence is installed while you wait for the latest
> to signal.
> 

After I typed this realized waiting on 'last fence' in the worker is a no
go given we want to pipeline preemptions in Xe (e.g. issue all queues
preemption commands to firmware in parallel as these are async
operations which may be fast in cases and slow in others). I think
having preempt vfunc done directly in a dma-fence CB is a must.

Matt

> Regards,
> Christian.
> 
> > 
> > > I tried to solve this by changing the dma_fence code to not call
> > > enable_signaling with the lock held, we wanted to do that anyway to prevent
> > > a bunch of issues with driver unload. But I realized that getting this
> > > upstream would take to long.
> > > 
> > > Long story short we moved handling the user fence into the work item.
> > > 
> > I did run into an issue when trying to make preempt_wait after return a
> > fence + attach a CB, and signal this preempt fence from the CB. I got
> > locking inversions almost worked through them but eventually gave up and
> > stuck with the worker.
> > 
> > Matt
> > 
> > > Apart from that looks rather solid to me.
> > > 
> > > Regards,
> > > Christian.
> > > 
> > > > +	if (err == -ENOENT)
> > > > +		dma_fence_preempt_issue(pfence);
> > > > +}
> > > > +
> > > > +static bool dma_fence_preempt_enable_signaling(struct dma_fence *fence)
> > > > +{
> > > > +	struct dma_fence_preempt *pfence =
> > > > +		container_of(fence, typeof(*pfence), base);
> > > > +
> > > > +	if (pfence->ops->preempt_delay)
> > > > +		dma_fence_preempt_delay(pfence);
> > > > +	else
> > > > +		dma_fence_preempt_issue(pfence);
> > > > +
> > > > +	return true;
> > > > +}
> > > > +
> > > > +static const struct dma_fence_ops preempt_fence_ops = {
> > > > +	.get_driver_name = dma_fence_preempt_get_driver_name,
> > > > +	.get_timeline_name = dma_fence_preempt_get_timeline_name,
> > > > +	.enable_signaling = dma_fence_preempt_enable_signaling,
> > > > +};
> > > > +
> > > > +/**
> > > > + * dma_fence_is_preempt() - Is preempt fence
> > > > + *
> > > > + * @fence: Preempt fence
> > > > + *
> > > > + * Return: True if preempt fence, False otherwise
> > > > + */
> > > > +bool dma_fence_is_preempt(const struct dma_fence *fence)
> > > > +{
> > > > +	return fence->ops == &preempt_fence_ops;
> > > > +}
> > > > +EXPORT_SYMBOL(dma_fence_is_preempt);
> > > > +
> > > > +/**
> > > > + * dma_fence_preempt_init() - Initial preempt fence
> > > > + *
> > > > + * @fence: Preempt fence
> > > > + * @ops: Preempt fence operations
> > > > + * @wq: Work queue for preempt wait, should have WQ_MEM_RECLAIM set
> > > > + * @context: Fence context
> > > > + * @seqno: Fence seqence number
> > > > + */
> > > > +void dma_fence_preempt_init(struct dma_fence_preempt *fence,
> > > > +			    const struct dma_fence_preempt_ops *ops,
> > > > +			    struct workqueue_struct *wq,
> > > > +			    u64 context, u64 seqno)
> > > > +{
> > > > +	/*
> > > > +	 * XXX: We really want to check wq for WQ_MEM_RECLAIM here but
> > > > +	 * workqueue_struct is private.
> > > > +	 */
> > > > +
> > > > +	fence->ops = ops;
> > > > +	fence->wq = wq;
> > > > +	INIT_WORK(&fence->work, dma_fence_preempt_work_func);
> > > > +	spin_lock_init(&fence->lock);
> > > > +	dma_fence_init(&fence->base, &preempt_fence_ops,
> > > > +		       &fence->lock, context, seqno);
> > > > +}
> > > > +EXPORT_SYMBOL(dma_fence_preempt_init);
> > > > diff --git a/include/linux/dma-fence-preempt.h b/include/linux/dma-fence-preempt.h
> > > > new file mode 100644
> > > > index 000000000000..28d803f89527
> > > > --- /dev/null
> > > > +++ b/include/linux/dma-fence-preempt.h
> > > > @@ -0,0 +1,56 @@
> > > > +/* SPDX-License-Identifier: MIT */
> > > > +/*
> > > > + * Copyright © 2024 Intel Corporation
> > > > + */
> > > > +
> > > > +#ifndef __LINUX_DMA_FENCE_PREEMPT_H
> > > > +#define __LINUX_DMA_FENCE_PREEMPT_H
> > > > +
> > > > +#include <linux/dma-fence.h>
> > > > +#include <linux/workqueue.h>
> > > > +
> > > > +struct dma_fence_preempt;
> > > > +struct dma_resv;
> > > > +
> > > > +/**
> > > > + * struct dma_fence_preempt_ops - Preempt fence operations
> > > > + *
> > > > + * These functions should be implemented in the driver side.
> > > > + */
> > > > +struct dma_fence_preempt_ops {
> > > > +	/** @preempt_delay: Preempt execution with a delay */
> > > > +	struct dma_fence *(*preempt_delay)(struct dma_fence_preempt *fence);
> > > > +	/** @preempt: Preempt execution */
> > > > +	int (*preempt)(struct dma_fence_preempt *fence);
> > > > +	/** @preempt_wait: Wait for preempt of execution to complete */
> > > > +	int (*preempt_wait)(struct dma_fence_preempt *fence);
> > > > +	/** @preempt_finished: Signal that the preempt has finished */
> > > > +	void (*preempt_finished)(struct dma_fence_preempt *fence);
> > > > +};
> > > > +
> > > > +/**
> > > > + * struct dma_fence_preempt - Embedded preempt fence base class
> > > > + */
> > > > +struct dma_fence_preempt {
> > > > +	/** @base: Fence base class */
> > > > +	struct dma_fence base;
> > > > +	/** @lock: Spinlock for fence handling */
> > > > +	spinlock_t lock;
> > > > +	/** @cb: Callback preempt delay */
> > > > +	struct dma_fence_cb cb;
> > > > +	/** @ops: Preempt fence operation */
> > > > +	const struct dma_fence_preempt_ops *ops;
> > > > +	/** @wq: Work queue for preempt wait */
> > > > +	struct workqueue_struct *wq;
> > > > +	/** @work: Work struct for preempt wait */
> > > > +	struct work_struct work;
> > > > +};
> > > > +
> > > > +bool dma_fence_is_preempt(const struct dma_fence *fence);
> > > > +
> > > > +void dma_fence_preempt_init(struct dma_fence_preempt *fence,
> > > > +			    const struct dma_fence_preempt_ops *ops,
> > > > +			    struct workqueue_struct *wq,
> > > > +			    u64 context, u64 seqno);
> > > > +
> > > > +#endif

Christian König Nov. 22, 2024, 10:56 a.m. UTC | #5

Am 21.11.24 um 19:41 schrieb Matthew Brost:
> [SNIP]
>>>>> +	ops->preempt_finished(pfence);
>>>> Why is that callback useful?
>>>>
>>> In Xe, this is where we kick the resume worker and drop a ref to the
>>> preemption object, which in Xe is an individual queue, and in AMD it is
>>> a VM, right?
>> Correct. The whole VM is preempted since we don't know which queue is using
>> which BO.
>>
> Right. In Xe we don't know which queue is using a BO either - we trigger
> all queues preempt fences attached to the VM effectively preempting the
> entire VM. Per VM preempt fence or per queue preempt fence is a driver
> choice (I can see arguments for both cases) is the point here and any
> base class shouldn't dictate what a driver wants to do.

Oh, I think we need to unify that for drivers or at least find an 
interface which works for both cases.

And yeah, I agree there are really good arguments for both directions.

Essentially you want separate preemption fences for each queue, but only 
one restore worker.

>>> wrt preemption object, I've reasoned this should work for
>>> an either per queue or VM driver design of preempt fences.
>>>
>>> This part likely could be moved into the preempt_wait callback though
>>> but would get a little goofy in the error case if preempt_wait is not
>>> called as the driver side would still need to cleanup a ref. Maybe I
>>> don't even need a ref though - have to think that through - but for
>>> general safety we typically like to take refs whenever a fence
>>> references a different object.
>> The tricky part is that at least for us we need to do this *before* the
>> fence is signaled.
> Hmm, I'm a little confused by this. Do you think the code as is missing
> somehthing or opposed to keeping the preempt_finished vfunc?

I think we first need a complete picture how all that is supposed to work.

When we say we resume only on demand then this callback would make sense 
I think, but at least the AMD solution doesn't do that at the moment.

>> This way we can do something like:
>>
>> retry:
>>      mutex_lock(&lock);
>>      if (dma_fence_is_signaled(preemept_fence)) {
>>          mutex_unlock(&lock);
>>          flush_work(resume_work);
>>          gotot retry;
>>      }
>>
> This snippet is from your convert user fence to dma fence IOCTL?

Yes.

> I think
> this makes sense given your design of the conver user to dma fence IOCTL
> not actually doing a resume - I landed making that IOCTL basically
> another version of the resume worker for simplicity but that may change
> if we find locking the entire VM is too costly.

Well we could also resume on demand (thinking more about it that's most 
likely the better approach) but I would implement it something like this 
then:

retry:
     mutex_lock(&lock);
     if (dma_fence_is_signaled(preemept_fence)) {
         mutex_unlock(&lock);
	schedule_work(resume_work);
         flush_work(resume_work);
         gotot retry;
     }


This way we don't run into issues when multiple participants try to 
resume at the same time. E.g. multiple threads where each one tries to 
submit work to different queues at the same time.

[SNIP]
>>>>> +	fence = pfence->ops->preempt_delay(pfence);
>>>> Mhm, why is that useful?
>>>>
>>> This for attaching the preempt object's last exported fence which needs
>>> to be signaled before the preemption is issued. So for purely long
>>> running VM's, this function could be NULL. For VM's with user queues +
>>> dma fences, the driver returns the last fence from the convert user
>>> fence to dma-fence IOCTL.
>>>
>>> I realized my kernel doc doesn't explain this as well as it should, I
>>> have already made this more verbose locally and hopefully it clearly
>>> explains all of this.
>> That part was actually obvious. But I would expected that to be push
>> interface instead of a pull interface.
>>
>> E.g. the preemption fence would also provide something like a manager object
>> which has a mutex, the last exported user fence and the necessary
>> functionality to update this user fence.
>>
> Hmm, I rather like the pull interface. In Xe this is dma-fence chain
> attached to the VM. It safe pull given our covert IOCTL takes the VM's
> dma-resv locks / notifier locks before publishing the user fence.

I don't think that will work like this.

Publishing the user fence must be serialized with signaling the 
preemption fence. And that serialization can't be done by the dma-resv 
lock nor the notifier lock because we can't let a dma_fence signaling 
depend on them.

That this is a separate lock or similar mechanism is a must have.

> In your design couldn't you use spin lock in the last step of publishing
> a user fence which checks for sw signaling on the preempt fence, if it
> enabled restart the IOCTL waiting the resume worker? Then in this vfunc
> pull the fence under the spin lock?

Yeah that could maybe work, but there is also a different challenge to 
keep in mind. See below.

> Not opposed to a push interface though if you really think this the way
> to go. Quite certain I could make that work for Xe too.

A push interface is just easier to validate. Keep in mind that you can 
only update the user fence when you can guarantee that the preemption 
fence is not signaled nor in the process of signaling.

So when you create a user fence the approach becomes:

1. kmalloc the fence structure,
2. Initialize the fence.
3. Push it into the premption manage of your queue, this makes sure that 
the queue is runable.
4. Publish the new fence in drm_sync, dma_resv etc...

> This definitely works as is - I've tested this. If dma-fence's lock was
> embedded within the dma-fence, then ofc lockdep would complain without
> nesting. It isn't though - the spin lock is passed in as argument so
> lockdep can identify the locks for 'preempt fence lock' and 'last
> exported fence' as independent locks.

Ah, good point.

But exactly that passing in of the lock is what we try to get away from 
to allow dma_fences to surpass the module they issued.

>>> Lockdep does not explode in Xe but maybe can buy this is a little
>>> unsafe. We could always move preempt_delay to the worker, attach a CB,
>>> and rekick the worker upon the last fence signaling if you think that is
>>> safer. Of course we could always just directly wait on the returned last
>>> fence in the worker too.
>> Yeah I that is basically what we do at the moment since you also need to
>> make sure that no new user fence is installed while you wait for the latest
>> to signal.
>>
> After I typed this realized waiting on 'last fence' in the worker is a no
> go given we want to pipeline preemptions in Xe (e.g. issue all queues
> preemption commands to firmware in parallel as these are async
> operations which may be fast in cases and slow in others). I think
> having preempt vfunc done directly in a dma-fence CB is a must.

At least with out current design that won't work because you need to 
somehow prevent installing new user fences while the preemption fence is 
signaling.

Currently we do that by holding a mutex, but you can't hold a mutex and 
return from a worker and then drop the mutex again in a different worker 
(ok in theory you can, but that is so strongly disregarded that upstream 
would probably reject the code).

Regards,
Christian.

>
> Matt
>

[RFC,01/29] dma-fence: Add dma_fence_preempt base class

Commit Message

Comments

Patch