Message ID | 20231108125824.995445-1-rajneesh.bhardwaj@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] drm/ttm: Schedule delayed_delete worker closer | expand |
Am 08.11.23 um 13:58 schrieb Rajneesh Bhardwaj: > Try to allocate system memory on the NUMA node the device is closest to > and try to run delayed_delete workers on a CPU of this node as well. > > To optimize the memory clearing operation when a TTM BO gets freed by > the delayed_delete worker, scheduling it closer to a NUMA node where the > memory was initially allocated helps avoid the cases where the worker > gets randomly scheduled on the CPU cores that are across interconnect > boundaries such as xGMI, PCIe etc. > > This change helps USWC GTT allocations on NUMA systems (dGPU) and AMD > APU platforms such as GFXIP9.4.3. > > Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> > Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Going to push this to drm-misc-next. Thanks, Christian. > --- > > Changes in v2: > - Absorbed the feedback provided by Christian in the commit message and > the comment. > > drivers/gpu/drm/ttm/ttm_bo.c | 8 +++++++- > drivers/gpu/drm/ttm/ttm_device.c | 3 ++- > 2 files changed, 9 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c > index 5757b9415e37..6f28a77a565b 100644 > --- a/drivers/gpu/drm/ttm/ttm_bo.c > +++ b/drivers/gpu/drm/ttm/ttm_bo.c > @@ -370,7 +370,13 @@ static void ttm_bo_release(struct kref *kref) > spin_unlock(&bo->bdev->lru_lock); > > INIT_WORK(&bo->delayed_delete, ttm_bo_delayed_delete); > - queue_work(bdev->wq, &bo->delayed_delete); > + > + /* Schedule the worker on the closest NUMA node. This > + * improves performance since system memory might be > + * cleared on free and that is best done on a CPU core > + * close to it. > + */ > + queue_work_node(bdev->pool.nid, bdev->wq, &bo->delayed_delete); > return; > } > > diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c > index 43e27ab77f95..72b81a2ee6c7 100644 > --- a/drivers/gpu/drm/ttm/ttm_device.c > +++ b/drivers/gpu/drm/ttm/ttm_device.c > @@ -213,7 +213,8 @@ int ttm_device_init(struct ttm_device *bdev, struct ttm_device_funcs *funcs, > bdev->funcs = funcs; > > ttm_sys_man_init(bdev); > - ttm_pool_init(&bdev->pool, dev, NUMA_NO_NODE, use_dma_alloc, use_dma32); > + > + ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, use_dma32); > > bdev->vma_manager = vma_manager; > spin_lock_init(&bdev->lru_lock);
On 11/8/2023 9:49 AM, Christian König wrote: > Am 08.11.23 um 13:58 schrieb Rajneesh Bhardwaj: >> Try to allocate system memory on the NUMA node the device is closest to >> and try to run delayed_delete workers on a CPU of this node as well. >> >> To optimize the memory clearing operation when a TTM BO gets freed by >> the delayed_delete worker, scheduling it closer to a NUMA node where the >> memory was initially allocated helps avoid the cases where the worker >> gets randomly scheduled on the CPU cores that are across interconnect >> boundaries such as xGMI, PCIe etc. >> >> This change helps USWC GTT allocations on NUMA systems (dGPU) and AMD >> APU platforms such as GFXIP9.4.3. >> >> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> >> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> > > Reviewed-by: Christian König <christian.koenig@amd.com> > > Going to push this to drm-misc-next. Thanks Christian, there is a new regression reported and I am checking on that. Please don't submit it yet. > > Thanks, > Christian. > >> --- >> >> Changes in v2: >> - Absorbed the feedback provided by Christian in the commit message >> and >> the comment. >> >> drivers/gpu/drm/ttm/ttm_bo.c | 8 +++++++- >> drivers/gpu/drm/ttm/ttm_device.c | 3 ++- >> 2 files changed, 9 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c >> index 5757b9415e37..6f28a77a565b 100644 >> --- a/drivers/gpu/drm/ttm/ttm_bo.c >> +++ b/drivers/gpu/drm/ttm/ttm_bo.c >> @@ -370,7 +370,13 @@ static void ttm_bo_release(struct kref *kref) >> spin_unlock(&bo->bdev->lru_lock); >> INIT_WORK(&bo->delayed_delete, ttm_bo_delayed_delete); >> - queue_work(bdev->wq, &bo->delayed_delete); >> + >> + /* Schedule the worker on the closest NUMA node. This >> + * improves performance since system memory might be >> + * cleared on free and that is best done on a CPU core >> + * close to it. >> + */ >> + queue_work_node(bdev->pool.nid, bdev->wq, >> &bo->delayed_delete); >> return; >> } >> diff --git a/drivers/gpu/drm/ttm/ttm_device.c >> b/drivers/gpu/drm/ttm/ttm_device.c >> index 43e27ab77f95..72b81a2ee6c7 100644 >> --- a/drivers/gpu/drm/ttm/ttm_device.c >> +++ b/drivers/gpu/drm/ttm/ttm_device.c >> @@ -213,7 +213,8 @@ int ttm_device_init(struct ttm_device *bdev, >> struct ttm_device_funcs *funcs, >> bdev->funcs = funcs; >> ttm_sys_man_init(bdev); >> - ttm_pool_init(&bdev->pool, dev, NUMA_NO_NODE, use_dma_alloc, >> use_dma32); >> + >> + ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, >> use_dma32); >> bdev->vma_manager = vma_manager; >> spin_lock_init(&bdev->lru_lock); >
On 2023-11-08 09:49, Christian König wrote: > Am 08.11.23 um 13:58 schrieb Rajneesh Bhardwaj: >> Try to allocate system memory on the NUMA node the device is closest to >> and try to run delayed_delete workers on a CPU of this node as well. >> >> To optimize the memory clearing operation when a TTM BO gets freed by >> the delayed_delete worker, scheduling it closer to a NUMA node where the >> memory was initially allocated helps avoid the cases where the worker >> gets randomly scheduled on the CPU cores that are across interconnect >> boundaries such as xGMI, PCIe etc. >> >> This change helps USWC GTT allocations on NUMA systems (dGPU) and AMD >> APU platforms such as GFXIP9.4.3. >> >> Acked-by: Felix Kuehling <Felix.Kuehling@amd.com> >> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> > > Reviewed-by: Christian König <christian.koenig@amd.com> > > Going to push this to drm-misc-next. Hold on. Rajneesh just pointed out a WARN regression from testing. I think the problem is that the bdev->wq is not unbound. Regards, Felix > > Thanks, > Christian. > >> --- >> >> Changes in v2: >> - Absorbed the feedback provided by Christian in the commit message >> and >> the comment. >> >> drivers/gpu/drm/ttm/ttm_bo.c | 8 +++++++- >> drivers/gpu/drm/ttm/ttm_device.c | 3 ++- >> 2 files changed, 9 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c >> index 5757b9415e37..6f28a77a565b 100644 >> --- a/drivers/gpu/drm/ttm/ttm_bo.c >> +++ b/drivers/gpu/drm/ttm/ttm_bo.c >> @@ -370,7 +370,13 @@ static void ttm_bo_release(struct kref *kref) >> spin_unlock(&bo->bdev->lru_lock); >> INIT_WORK(&bo->delayed_delete, ttm_bo_delayed_delete); >> - queue_work(bdev->wq, &bo->delayed_delete); >> + >> + /* Schedule the worker on the closest NUMA node. This >> + * improves performance since system memory might be >> + * cleared on free and that is best done on a CPU core >> + * close to it. >> + */ >> + queue_work_node(bdev->pool.nid, bdev->wq, >> &bo->delayed_delete); >> return; >> } >> diff --git a/drivers/gpu/drm/ttm/ttm_device.c >> b/drivers/gpu/drm/ttm/ttm_device.c >> index 43e27ab77f95..72b81a2ee6c7 100644 >> --- a/drivers/gpu/drm/ttm/ttm_device.c >> +++ b/drivers/gpu/drm/ttm/ttm_device.c >> @@ -213,7 +213,8 @@ int ttm_device_init(struct ttm_device *bdev, >> struct ttm_device_funcs *funcs, >> bdev->funcs = funcs; >> ttm_sys_man_init(bdev); >> - ttm_pool_init(&bdev->pool, dev, NUMA_NO_NODE, use_dma_alloc, >> use_dma32); >> + >> + ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, >> use_dma32); >> bdev->vma_manager = vma_manager; >> spin_lock_init(&bdev->lru_lock); >
diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index 5757b9415e37..6f28a77a565b 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -370,7 +370,13 @@ static void ttm_bo_release(struct kref *kref) spin_unlock(&bo->bdev->lru_lock); INIT_WORK(&bo->delayed_delete, ttm_bo_delayed_delete); - queue_work(bdev->wq, &bo->delayed_delete); + + /* Schedule the worker on the closest NUMA node. This + * improves performance since system memory might be + * cleared on free and that is best done on a CPU core + * close to it. + */ + queue_work_node(bdev->pool.nid, bdev->wq, &bo->delayed_delete); return; } diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c index 43e27ab77f95..72b81a2ee6c7 100644 --- a/drivers/gpu/drm/ttm/ttm_device.c +++ b/drivers/gpu/drm/ttm/ttm_device.c @@ -213,7 +213,8 @@ int ttm_device_init(struct ttm_device *bdev, struct ttm_device_funcs *funcs, bdev->funcs = funcs; ttm_sys_man_init(bdev); - ttm_pool_init(&bdev->pool, dev, NUMA_NO_NODE, use_dma_alloc, use_dma32); + + ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, use_dma32); bdev->vma_manager = vma_manager; spin_lock_init(&bdev->lru_lock);