Message ID | 20231107194554.945018-1-rajneesh.bhardwaj@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/ttm: Schedule delayed_delete worker closer | expand |
On 2023-11-07 14:45, Rajneesh Bhardwaj wrote: > When a TTM BO is getting freed, to optimize the clearing operation on > the workqueue, schedule it closer to a NUMA node where the memory was > allocated. This avoids the cases where the ttm_bo_delayed_delete gets > scheduled on the CPU cores that are across interconnect boundaries such > as xGMI, PCIe etc. > > This change helps USWC GTT allocations on NUMA systems (dGPU) and AMD > APU platforms such as GFXIP9.4.3. > > Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> > --- > drivers/gpu/drm/ttm/ttm_bo.c | 10 +++++++++- > drivers/gpu/drm/ttm/ttm_device.c | 3 ++- > 2 files changed, 11 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c > index 5757b9415e37..0d608441a112 100644 > --- a/drivers/gpu/drm/ttm/ttm_bo.c > +++ b/drivers/gpu/drm/ttm/ttm_bo.c > @@ -370,7 +370,15 @@ static void ttm_bo_release(struct kref *kref) > spin_unlock(&bo->bdev->lru_lock); > > INIT_WORK(&bo->delayed_delete, ttm_bo_delayed_delete); > - queue_work(bdev->wq, &bo->delayed_delete); > + /* Schedule the worker on the closest NUMA node, if no > + * CPUs are available, this falls back to any CPU core > + * available system wide. This helps avoid the > + * bottleneck to clear memory in cases where the worker > + * is scheduled on a CPU which is remote to the node > + * where the memory is getting freed. > + */ > + > + queue_work_node(bdev->pool.nid, bdev->wq, &bo->delayed_delete); > return; > } > > diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c > index 43e27ab77f95..72b81a2ee6c7 100644 > --- a/drivers/gpu/drm/ttm/ttm_device.c > +++ b/drivers/gpu/drm/ttm/ttm_device.c > @@ -213,7 +213,8 @@ int ttm_device_init(struct ttm_device *bdev, struct ttm_device_funcs *funcs, > bdev->funcs = funcs; > > ttm_sys_man_init(bdev); > - ttm_pool_init(&bdev->pool, dev, NUMA_NO_NODE, use_dma_alloc, use_dma32); > + > + ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, use_dma32); > > bdev->vma_manager = vma_manager; > spin_lock_init(&bdev->lru_lock);
Am 07.11.23 um 20:45 schrieb Rajneesh Bhardwaj: > When a TTM BO is getting freed, to optimize the clearing operation on > the workqueue, schedule it closer to a NUMA node where the memory was > allocated. This avoids the cases where the ttm_bo_delayed_delete gets > scheduled on the CPU cores that are across interconnect boundaries such > as xGMI, PCIe etc. This needs more background and doesn't mention that we now try to allocate the memory close to the device. Something like this here should work: Try to allocate system memory on the NUMA node the device is closest to and try to run delayed delete workers on a CPU of this node as well. The background of running the delayed delete worker on a NUMA node close to the one of the initial allocation is that the memory might be cleared on free by the core memory management and that should probably be done on a CPU close to it. > > This change helps USWC GTT allocations on NUMA systems (dGPU) and AMD > APU platforms such as GFXIP9.4.3. > > Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> > --- > drivers/gpu/drm/ttm/ttm_bo.c | 10 +++++++++- > drivers/gpu/drm/ttm/ttm_device.c | 3 ++- > 2 files changed, 11 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c > index 5757b9415e37..0d608441a112 100644 > --- a/drivers/gpu/drm/ttm/ttm_bo.c > +++ b/drivers/gpu/drm/ttm/ttm_bo.c > @@ -370,7 +370,15 @@ static void ttm_bo_release(struct kref *kref) > spin_unlock(&bo->bdev->lru_lock); > > INIT_WORK(&bo->delayed_delete, ttm_bo_delayed_delete); > - queue_work(bdev->wq, &bo->delayed_delete); > + /* Schedule the worker on the closest NUMA node, > if no > + * CPUs are available, this falls back to any CPU core > + * available system wide. Mentioning that is superfluous since everybody can look at the implementation and that a fallback is available for a function which doesn't return an error is obvious. > This helps avoid the > + * bottleneck to clear memory in cases where the worker > + * is scheduled on a CPU which is remote to the node > + * where the memory is getting freed. > + */ Rather write something like "This improves performance since system memory might be cleared on free and that is best done on a CPU core close to it." Regards, Christian. > + > + queue_work_node(bdev->pool.nid, bdev->wq, &bo->delayed_delete); > return; > } > > diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c > index 43e27ab77f95..72b81a2ee6c7 100644 > --- a/drivers/gpu/drm/ttm/ttm_device.c > +++ b/drivers/gpu/drm/ttm/ttm_device.c > @@ -213,7 +213,8 @@ int ttm_device_init(struct ttm_device *bdev, struct ttm_device_funcs *funcs, > bdev->funcs = funcs; > > ttm_sys_man_init(bdev); > - ttm_pool_init(&bdev->pool, dev, NUMA_NO_NODE, use_dma_alloc, use_dma32); > + > + ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, use_dma32); > > bdev->vma_manager = vma_manager; > spin_lock_init(&bdev->lru_lock);
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 5757b9415e37..0d608441a112 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -370,7 +370,15 @@ static void ttm_bo_release(struct kref *kref) spin_unlock(&bo->bdev->lru_lock); INIT_WORK(&bo->delayed_delete, ttm_bo_delayed_delete); - queue_work(bdev->wq, &bo->delayed_delete); + /* Schedule the worker on the closest NUMA node, if no + * CPUs are available, this falls back to any CPU core + * available system wide. This helps avoid the + * bottleneck to clear memory in cases where the worker + * is scheduled on a CPU which is remote to the node + * where the memory is getting freed. + */ + + queue_work_node(bdev->pool.nid, bdev->wq, &bo->delayed_delete); return; } diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c index 43e27ab77f95..72b81a2ee6c7 100644 --- a/drivers/gpu/drm/ttm/ttm_device.c +++ b/drivers/gpu/drm/ttm/ttm_device.c @@ -213,7 +213,8 @@ int ttm_device_init(struct ttm_device *bdev, struct ttm_device_funcs *funcs, bdev->funcs = funcs; ttm_sys_man_init(bdev); - ttm_pool_init(&bdev->pool, dev, NUMA_NO_NODE, use_dma_alloc, use_dma32); + + ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, use_dma32); bdev->vma_manager = vma_manager; spin_lock_init(&bdev->lru_lock);
When a TTM BO is getting freed, to optimize the clearing operation on the workqueue, schedule it closer to a NUMA node where the memory was allocated. This avoids the cases where the ttm_bo_delayed_delete gets scheduled on the CPU cores that are across interconnect boundaries such as xGMI, PCIe etc. This change helps USWC GTT allocations on NUMA systems (dGPU) and AMD APU platforms such as GFXIP9.4.3. Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> --- drivers/gpu/drm/ttm/ttm_bo.c | 10 +++++++++- drivers/gpu/drm/ttm/ttm_device.c | 3 ++- 2 files changed, 11 insertions(+), 2 deletions(-)