diff mbox series

drm/ttm: Schedule delayed_delete worker closer

Message ID 20231107194554.945018-1-rajneesh.bhardwaj@amd.com (mailing list archive)
State New, archived
Headers show
Series drm/ttm: Schedule delayed_delete worker closer | expand

Commit Message

Rajneesh Bhardwaj Nov. 7, 2023, 7:45 p.m. UTC
When a TTM BO is getting freed, to optimize the clearing operation on
the workqueue, schedule it closer to a NUMA node where the memory was
allocated. This avoids the cases where the ttm_bo_delayed_delete gets
scheduled on the CPU cores that are across interconnect boundaries such
as xGMI, PCIe etc.

This change helps USWC GTT allocations on NUMA systems (dGPU) and AMD
APU platforms such as GFXIP9.4.3.

Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
---
 drivers/gpu/drm/ttm/ttm_bo.c     | 10 +++++++++-
 drivers/gpu/drm/ttm/ttm_device.c |  3 ++-
 2 files changed, 11 insertions(+), 2 deletions(-)

Comments

Felix Kuehling Nov. 7, 2023, 10:07 p.m. UTC | #1
On 2023-11-07 14:45, Rajneesh Bhardwaj wrote:
> When a TTM BO is getting freed, to optimize the clearing operation on
> the workqueue, schedule it closer to a NUMA node where the memory was
> allocated. This avoids the cases where the ttm_bo_delayed_delete gets
> scheduled on the CPU cores that are across interconnect boundaries such
> as xGMI, PCIe etc.
>
> This change helps USWC GTT allocations on NUMA systems (dGPU) and AMD
> APU platforms such as GFXIP9.4.3.
>
> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>


> ---
>   drivers/gpu/drm/ttm/ttm_bo.c     | 10 +++++++++-
>   drivers/gpu/drm/ttm/ttm_device.c |  3 ++-
>   2 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index 5757b9415e37..0d608441a112 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -370,7 +370,15 @@ static void ttm_bo_release(struct kref *kref)
>   			spin_unlock(&bo->bdev->lru_lock);
>   
>   			INIT_WORK(&bo->delayed_delete, ttm_bo_delayed_delete);
> -			queue_work(bdev->wq, &bo->delayed_delete);
> +			/* Schedule the worker on the closest NUMA node, if no
> +			 * CPUs are available, this falls back to any CPU core
> +			 * available system wide. This helps avoid the
> +			 * bottleneck to clear memory in cases where the worker
> +			 * is scheduled on a CPU which is remote to the node
> +			 * where the memory is getting freed.
> +			 */
> +
> +			queue_work_node(bdev->pool.nid, bdev->wq, &bo->delayed_delete);
>   			return;
>   		}
>   
> diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
> index 43e27ab77f95..72b81a2ee6c7 100644
> --- a/drivers/gpu/drm/ttm/ttm_device.c
> +++ b/drivers/gpu/drm/ttm/ttm_device.c
> @@ -213,7 +213,8 @@ int ttm_device_init(struct ttm_device *bdev, struct ttm_device_funcs *funcs,
>   	bdev->funcs = funcs;
>   
>   	ttm_sys_man_init(bdev);
> -	ttm_pool_init(&bdev->pool, dev, NUMA_NO_NODE, use_dma_alloc, use_dma32);
> +
> +	ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, use_dma32);
>   
>   	bdev->vma_manager = vma_manager;
>   	spin_lock_init(&bdev->lru_lock);
Christian König Nov. 8, 2023, 8:36 a.m. UTC | #2
Am 07.11.23 um 20:45 schrieb Rajneesh Bhardwaj:
> When a TTM BO is getting freed, to optimize the clearing operation on
> the workqueue, schedule it closer to a NUMA node where the memory was
> allocated. This avoids the cases where the ttm_bo_delayed_delete gets
> scheduled on the CPU cores that are across interconnect boundaries such
> as xGMI, PCIe etc.

This needs more background and doesn't mention that we now try to 
allocate the memory close to the device.

Something like this here should work:

Try to allocate system memory on the NUMA node the device is closest to 
and try to run delayed delete workers on a CPU of this node as well.

The background of running the delayed delete worker on a NUMA node close 
to the one of the initial allocation is that the memory might be cleared 
on free by the core memory management and that should probably be done 
on a CPU close to it.

>
> This change helps USWC GTT allocations on NUMA systems (dGPU) and AMD
> APU platforms such as GFXIP9.4.3.
>
> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
> ---
>   drivers/gpu/drm/ttm/ttm_bo.c     | 10 +++++++++-
>   drivers/gpu/drm/ttm/ttm_device.c |  3 ++-
>   2 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index 5757b9415e37..0d608441a112 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -370,7 +370,15 @@ static void ttm_bo_release(struct kref *kref)
>   			spin_unlock(&bo->bdev->lru_lock);
>   
>   			INIT_WORK(&bo->delayed_delete, ttm_bo_delayed_delete);
> -			queue_work(bdev->wq, &bo->delayed_delete);
> +			/* Schedule the worker on the closest NUMA node,

>   if no
> +			 * CPUs are available, this falls back to any CPU core
> +			 * available system wide.

Mentioning that is superfluous since everybody can look at the 
implementation and that a fallback is available for a function which 
doesn't return an error is obvious.

>   This helps avoid the
> +			 * bottleneck to clear memory in cases where the worker
> +			 * is scheduled on a CPU which is remote to the node
> +			 * where the memory is getting freed.
> +			 */

Rather write something like "This improves performance since system 
memory might be cleared on free and that is best done on a CPU core 
close to it."

Regards,
Christian.

> +
> +			queue_work_node(bdev->pool.nid, bdev->wq, &bo->delayed_delete);
>   			return;
>   		}
>   
> diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
> index 43e27ab77f95..72b81a2ee6c7 100644
> --- a/drivers/gpu/drm/ttm/ttm_device.c
> +++ b/drivers/gpu/drm/ttm/ttm_device.c
> @@ -213,7 +213,8 @@ int ttm_device_init(struct ttm_device *bdev, struct ttm_device_funcs *funcs,
>   	bdev->funcs = funcs;
>   
>   	ttm_sys_man_init(bdev);
> -	ttm_pool_init(&bdev->pool, dev, NUMA_NO_NODE, use_dma_alloc, use_dma32);
> +
> +	ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, use_dma32);
>   
>   	bdev->vma_manager = vma_manager;
>   	spin_lock_init(&bdev->lru_lock);
diff mbox series

Patch

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index 5757b9415e37..0d608441a112 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -370,7 +370,15 @@  static void ttm_bo_release(struct kref *kref)
 			spin_unlock(&bo->bdev->lru_lock);
 
 			INIT_WORK(&bo->delayed_delete, ttm_bo_delayed_delete);
-			queue_work(bdev->wq, &bo->delayed_delete);
+			/* Schedule the worker on the closest NUMA node, if no
+			 * CPUs are available, this falls back to any CPU core
+			 * available system wide. This helps avoid the
+			 * bottleneck to clear memory in cases where the worker
+			 * is scheduled on a CPU which is remote to the node
+			 * where the memory is getting freed.
+			 */
+
+			queue_work_node(bdev->pool.nid, bdev->wq, &bo->delayed_delete);
 			return;
 		}
 
diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
index 43e27ab77f95..72b81a2ee6c7 100644
--- a/drivers/gpu/drm/ttm/ttm_device.c
+++ b/drivers/gpu/drm/ttm/ttm_device.c
@@ -213,7 +213,8 @@  int ttm_device_init(struct ttm_device *bdev, struct ttm_device_funcs *funcs,
 	bdev->funcs = funcs;
 
 	ttm_sys_man_init(bdev);
-	ttm_pool_init(&bdev->pool, dev, NUMA_NO_NODE, use_dma_alloc, use_dma32);
+
+	ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, use_dma32);
 
 	bdev->vma_manager = vma_manager;
 	spin_lock_init(&bdev->lru_lock);