Message ID | 20231111130856.1168304-1-rajneesh.bhardwaj@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] drm/ttm: Schedule delayed_delete worker closer | expand |
Am 11.11.23 um 14:08 schrieb Rajneesh Bhardwaj: > Try to allocate system memory on the NUMA node the device is closest to > and try to run delayed_delete workers on a CPU of this node as well. > > To optimize the memory clearing operation when a TTM BO gets freed by > the delayed_delete worker, scheduling it closer to a NUMA node where the > memory was initially allocated helps avoid the cases where the worker > gets randomly scheduled on the CPU cores that are across interconnect > boundaries such as xGMI, PCIe etc. > > This change helps USWC GTT allocations on NUMA systems (dGPU) and AMD > APU platforms such as GFXIP9.4.3. > > Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> > Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> > --- > Changes in v3: > * Use WQ_UNBOUND to address the warning reported by CI pipeline. > > drivers/gpu/drm/ttm/ttm_bo.c | 8 +++++++- > drivers/gpu/drm/ttm/ttm_device.c | 6 ++++-- > 2 files changed, 11 insertions(+), 3 deletions(-) > > diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c > index 5757b9415e37..6f28a77a565b 100644 > --- a/drivers/gpu/drm/ttm/ttm_bo.c > +++ b/drivers/gpu/drm/ttm/ttm_bo.c > @@ -370,7 +370,13 @@ static void ttm_bo_release(struct kref *kref) > spin_unlock(&bo->bdev->lru_lock); > > INIT_WORK(&bo->delayed_delete, ttm_bo_delayed_delete); > - queue_work(bdev->wq, &bo->delayed_delete); > + > + /* Schedule the worker on the closest NUMA node. This > + * improves performance since system memory might be > + * cleared on free and that is best done on a CPU core > + * close to it. > + */ > + queue_work_node(bdev->pool.nid, bdev->wq, &bo->delayed_delete); > return; > } > > diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c > index 43e27ab77f95..bc97e3dd40f0 100644 > --- a/drivers/gpu/drm/ttm/ttm_device.c > +++ b/drivers/gpu/drm/ttm/ttm_device.c > @@ -204,7 +204,8 @@ int ttm_device_init(struct ttm_device *bdev, struct ttm_device_funcs *funcs, > if (ret) > return ret; > > - bdev->wq = alloc_workqueue("ttm", WQ_MEM_RECLAIM | WQ_HIGHPRI, 16); > + bdev->wq = alloc_workqueue("ttm", > + WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 16); > if (!bdev->wq) { > ttm_global_release(); > return -ENOMEM; > @@ -213,7 +214,8 @@ int ttm_device_init(struct ttm_device *bdev, struct ttm_device_funcs *funcs, > bdev->funcs = funcs; > > ttm_sys_man_init(bdev); > - ttm_pool_init(&bdev->pool, dev, NUMA_NO_NODE, use_dma_alloc, use_dma32); > + > + ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, use_dma32); > > bdev->vma_manager = vma_manager; > spin_lock_init(&bdev->lru_lock);
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 5757b9415e37..6f28a77a565b 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -370,7 +370,13 @@ static void ttm_bo_release(struct kref *kref) spin_unlock(&bo->bdev->lru_lock); INIT_WORK(&bo->delayed_delete, ttm_bo_delayed_delete); - queue_work(bdev->wq, &bo->delayed_delete); + + /* Schedule the worker on the closest NUMA node. This + * improves performance since system memory might be + * cleared on free and that is best done on a CPU core + * close to it. + */ + queue_work_node(bdev->pool.nid, bdev->wq, &bo->delayed_delete); return; } diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c index 43e27ab77f95..bc97e3dd40f0 100644 --- a/drivers/gpu/drm/ttm/ttm_device.c +++ b/drivers/gpu/drm/ttm/ttm_device.c @@ -204,7 +204,8 @@ int ttm_device_init(struct ttm_device *bdev, struct ttm_device_funcs *funcs, if (ret) return ret; - bdev->wq = alloc_workqueue("ttm", WQ_MEM_RECLAIM | WQ_HIGHPRI, 16); + bdev->wq = alloc_workqueue("ttm", + WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 16); if (!bdev->wq) { ttm_global_release(); return -ENOMEM; @@ -213,7 +214,8 @@ int ttm_device_init(struct ttm_device *bdev, struct ttm_device_funcs *funcs, bdev->funcs = funcs; ttm_sys_man_init(bdev); - ttm_pool_init(&bdev->pool, dev, NUMA_NO_NODE, use_dma_alloc, use_dma32); + + ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, use_dma32); bdev->vma_manager = vma_manager; spin_lock_init(&bdev->lru_lock);