diff mbox series

drm/ttm: Implement strict NUMA pool allocations

Message ID 20240322070753.69324-1-rajneesh.bhardwaj@amd.com (mailing list archive)
State New, archived
Headers show
Series drm/ttm: Implement strict NUMA pool allocations | expand

Commit Message

Rajneesh Bhardwaj March 22, 2024, 7:07 a.m. UTC
This change allows TTM to be flexible to honor NUMA localized
allocations which can result in significant performance improvement on a
multi socket NUMA system. On GFXIP 9.4.3 based AMD APUs, we see
manyfold benefits of this change resulting not only in ~10% performance
improvement in certain benchmarks but also generating more consistent
and less sporadic results specially when the NUMA balancing is not
explecitely disabled. In certain scenarios, workloads show a run-to-run
variability e.g. HPL would show a ~10x performance drop after running
back to back 4-5 times and would recover later on a subsequent run. This
is seen with memory intensive other workloads too. It was seen that when
CPU caches were dropped e.g. sudo sysctl -w vm.drop_caches=1, the
variability reduced but the performance was still well below a good run.

Use of __GFP_THISNODE flag ensures that during memory allocation, kernel
prioritizes memory allocations from the local or closest NUMA node
thereby reducing memory access latency. When memory is allocated using
__GFP_THISNODE flag, memory allocations will predominantly be done on
the local node, consequency, the shrinkers may priotitize reclaiming
memory from caches assocoated with local node to maintain memory
locality and minimize latency, thereby provide better shinker targeting.

Reduced memory pressure on remote nodes, can also indirectly influence
shrinker behavior by potentially reducing the frequency and intensity of
memory reclamation operation on remote nodes and could provide improved
overall system performance.

While this change could be more beneficial in general, i.e., without the
use of a module parameter, but in absence of widespread testing, limit
it to the AMD GFXIP 9.4.3 based ttm pool initializations only.


Cc: Joe Greathouse <joseph.greathouse@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h       |  1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c   |  8 ++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c   |  7 ++++++-
 drivers/gpu/drm/ttm/tests/ttm_pool_test.c | 10 +++++-----
 drivers/gpu/drm/ttm/ttm_device.c          |  2 +-
 drivers/gpu/drm/ttm/ttm_pool.c            |  7 ++++++-
 include/drm/ttm/ttm_pool.h                |  4 +++-
 7 files changed, 30 insertions(+), 9 deletions(-)

Comments

Christian König March 22, 2024, 1:15 p.m. UTC | #1
Am 22.03.24 um 08:07 schrieb Rajneesh Bhardwaj:
> This change allows TTM to be flexible to honor NUMA localized
> allocations which can result in significant performance improvement on a
> multi socket NUMA system. On GFXIP 9.4.3 based AMD APUs, we see
> manyfold benefits of this change resulting not only in ~10% performance
> improvement in certain benchmarks but also generating more consistent
> and less sporadic results specially when the NUMA balancing is not
> explecitely disabled. In certain scenarios, workloads show a run-to-run
> variability e.g. HPL would show a ~10x performance drop after running
> back to back 4-5 times and would recover later on a subsequent run. This
> is seen with memory intensive other workloads too. It was seen that when
> CPU caches were dropped e.g. sudo sysctl -w vm.drop_caches=1, the
> variability reduced but the performance was still well below a good run.
>
> Use of __GFP_THISNODE flag ensures that during memory allocation, kernel
> prioritizes memory allocations from the local or closest NUMA node
> thereby reducing memory access latency.

Exactly that's what it doesn't do.

__GFP_THISNODE just means it enforces allocation from the specified node.

Additional to that there is a mandatory requirement that this flag is 
only used if it has to be used for correctness. And that is simply not 
the case here.

So as long as nobody can explain why that should help this is an 
absolutely no-go.

Regards,
Christian.

>   When memory is allocated using
> __GFP_THISNODE flag, memory allocations will predominantly be done on
> the local node, consequency, the shrinkers may priotitize reclaiming
> memory from caches assocoated with local node to maintain memory
> locality and minimize latency, thereby provide better shinker targeting.
>
> Reduced memory pressure on remote nodes, can also indirectly influence
> shrinker behavior by potentially reducing the frequency and intensity of
> memory reclamation operation on remote nodes and could provide improved
> overall system performance.
>
> While this change could be more beneficial in general, i.e., without the
> use of a module parameter, but in absence of widespread testing, limit
> it to the AMD GFXIP 9.4.3 based ttm pool initializations only.
>
>
> Cc: Joe Greathouse <joseph.greathouse@amd.com>
> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h       |  1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c   |  8 ++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c   |  7 ++++++-
>   drivers/gpu/drm/ttm/tests/ttm_pool_test.c | 10 +++++-----
>   drivers/gpu/drm/ttm/ttm_device.c          |  2 +-
>   drivers/gpu/drm/ttm/ttm_pool.c            |  7 ++++++-
>   include/drm/ttm/ttm_pool.h                |  4 +++-
>   7 files changed, 30 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 9c62552bec34..96532cfc6230 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -253,6 +253,7 @@ extern int amdgpu_user_partt_mode;
>   extern int amdgpu_agp;
>   
>   extern int amdgpu_wbrf;
> +extern bool strict_numa_alloc;
>   
>   #define AMDGPU_VM_MAX_NUM_CTX			4096
>   #define AMDGPU_SG_THRESHOLD			(256*1024*1024)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 80b9642f2bc4..a183a6b4493d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -781,6 +781,14 @@ int queue_preemption_timeout_ms = 9000;
>   module_param(queue_preemption_timeout_ms, int, 0644);
>   MODULE_PARM_DESC(queue_preemption_timeout_ms, "queue preemption timeout in ms (1 = Minimum, 9000 = default)");
>   
> +/**
> + * DOC: strict_numa_alloc(bool)
> + * Policy to force NUMA allocation requests from the proximity NUMA domain only.
> + */
> +bool strict_numa_alloc;
> +module_param(strict_numa_alloc, bool, 0444);
> +MODULE_PARM_DESC(strict_numa_alloc, "Force NUMA allocation requests to be satisfied from the closest node only (false = default)");
> +
>   /**
>    * DOC: debug_evictions(bool)
>    * Enable extra debug messages to help determine the cause of evictions
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index b0ed10f4de60..a9f78f85e28c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -1768,6 +1768,7 @@ static int amdgpu_ttm_reserve_tmr(struct amdgpu_device *adev)
>   
>   static int amdgpu_ttm_pools_init(struct amdgpu_device *adev)
>   {
> +	bool policy = true;
>   	int i;
>   
>   	if (!adev->gmc.is_app_apu || !adev->gmc.num_mem_partitions)
> @@ -1779,11 +1780,15 @@ static int amdgpu_ttm_pools_init(struct amdgpu_device *adev)
>   	if (!adev->mman.ttm_pools)
>   		return -ENOMEM;
>   
> +	/* Policy not only depends on the module param but also on the ASIC
> +	 * setting use_strict_numa_alloc as well.
> +	 */
>   	for (i = 0; i < adev->gmc.num_mem_partitions; i++) {
>   		ttm_pool_init(&adev->mman.ttm_pools[i], adev->dev,
>   			      adev->gmc.mem_partitions[i].numa.node,
> -			      false, false);
> +			      false, false, policy && strict_numa_alloc);
>   	}
> +
>   	return 0;
>   }
>   
> diff --git a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
> index 2d9cae8cd984..6ff47aac570a 100644
> --- a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
> +++ b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
> @@ -87,7 +87,7 @@ static struct ttm_pool *ttm_pool_pre_populated(struct kunit *test,
>   	pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
>   	KUNIT_ASSERT_NOT_NULL(test, pool);
>   
> -	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
> +	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false, false);
>   
>   	err = ttm_pool_alloc(pool, tt, &simple_ctx);
>   	KUNIT_ASSERT_EQ(test, err, 0);
> @@ -152,7 +152,7 @@ static void ttm_pool_alloc_basic(struct kunit *test)
>   	KUNIT_ASSERT_NOT_NULL(test, pool);
>   
>   	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, params->use_dma_alloc,
> -		      false);
> +		      false, false);
>   
>   	KUNIT_ASSERT_PTR_EQ(test, pool->dev, devs->dev);
>   	KUNIT_ASSERT_EQ(test, pool->nid, NUMA_NO_NODE);
> @@ -219,7 +219,7 @@ static void ttm_pool_alloc_basic_dma_addr(struct kunit *test)
>   	pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
>   	KUNIT_ASSERT_NOT_NULL(test, pool);
>   
> -	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
> +	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false, false);
>   
>   	err = ttm_pool_alloc(pool, tt, &simple_ctx);
>   	KUNIT_ASSERT_EQ(test, err, 0);
> @@ -349,7 +349,7 @@ static void ttm_pool_free_dma_alloc(struct kunit *test)
>   	pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
>   	KUNIT_ASSERT_NOT_NULL(test, pool);
>   
> -	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
> +	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false, false);
>   	ttm_pool_alloc(pool, tt, &simple_ctx);
>   
>   	pt = &pool->caching[caching].orders[order];
> @@ -380,7 +380,7 @@ static void ttm_pool_free_no_dma_alloc(struct kunit *test)
>   	pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
>   	KUNIT_ASSERT_NOT_NULL(test, pool);
>   
> -	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, false, false);
> +	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, false, false, false);
>   	ttm_pool_alloc(pool, tt, &simple_ctx);
>   
>   	pt = &pool->caching[caching].orders[order];
> diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
> index f5187b384ae9..540e8a44f015 100644
> --- a/drivers/gpu/drm/ttm/ttm_device.c
> +++ b/drivers/gpu/drm/ttm/ttm_device.c
> @@ -215,7 +215,7 @@ int ttm_device_init(struct ttm_device *bdev, const struct ttm_device_funcs *func
>   
>   	ttm_sys_man_init(bdev);
>   
> -	ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, use_dma32);
> +	ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, use_dma32, false);
>   
>   	bdev->vma_manager = vma_manager;
>   	spin_lock_init(&bdev->lru_lock);
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> index dbc96984d331..73aafd06c361 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -447,6 +447,9 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>   	else
>   		gfp_flags |= GFP_HIGHUSER;
>   
> +	if (pool->use_strict_numa_alloc)
> +		gfp_flags |= __GFP_THISNODE;
> +
>   	for (order = min_t(unsigned int, MAX_ORDER, __fls(num_pages));
>   	     num_pages;
>   	     order = min_t(unsigned int, order, __fls(num_pages))) {
> @@ -555,7 +558,8 @@ EXPORT_SYMBOL(ttm_pool_free);
>    * Initialize the pool and its pool types.
>    */
>   void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
> -		   int nid, bool use_dma_alloc, bool use_dma32)
> +		   int nid, bool use_dma_alloc, bool use_dma32,
> +		   bool use_strict_numa_alloc)
>   {
>   	unsigned int i, j;
>   
> @@ -565,6 +569,7 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>   	pool->nid = nid;
>   	pool->use_dma_alloc = use_dma_alloc;
>   	pool->use_dma32 = use_dma32;
> +	pool->use_strict_numa_alloc = use_strict_numa_alloc;
>   
>   	if (use_dma_alloc || nid != NUMA_NO_NODE) {
>   		for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
> diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
> index 30a347e5aa11..6b7bdc952466 100644
> --- a/include/drm/ttm/ttm_pool.h
> +++ b/include/drm/ttm/ttm_pool.h
> @@ -72,6 +72,7 @@ struct ttm_pool {
>   
>   	bool use_dma_alloc;
>   	bool use_dma32;
> +	bool use_strict_numa_alloc;
>   
>   	struct {
>   		struct ttm_pool_type orders[MAX_ORDER + 1];
> @@ -83,7 +84,8 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
>   void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt);
>   
>   void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
> -		   int nid, bool use_dma_alloc, bool use_dma32);
> +		   int nid, bool use_dma_alloc, bool use_dma32,
> +		   bool use_strict_numa_alloc);
>   void ttm_pool_fini(struct ttm_pool *pool);
>   
>   int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m);
Michael J. Ruhl March 22, 2024, 3:29 p.m. UTC | #2
>-----Original Message-----
>From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
>Rajneesh Bhardwaj
>Sent: Friday, March 22, 2024 3:08 AM
>To: amd-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org
>Cc: felix.kuehling@amd.com; alexander.deucher@amd.com;
>christian.koenig@amd.com; Rajneesh Bhardwaj
><rajneesh.bhardwaj@amd.com>; Joe Greathouse
><joseph.greathouse@amd.com>
>Subject: [PATCH] drm/ttm: Implement strict NUMA pool allocations
>
>This change allows TTM to be flexible to honor NUMA localized
>allocations which can result in significant performance improvement on a
>multi socket NUMA system. On GFXIP 9.4.3 based AMD APUs, we see
>manyfold benefits of this change resulting not only in ~10% performance
>improvement in certain benchmarks but also generating more consistent
>and less sporadic results specially when the NUMA balancing is not
>explecitely disabled. In certain scenarios, workloads show a run-to-run
>variability e.g. HPL would show a ~10x performance drop after running
>back to back 4-5 times and would recover later on a subsequent run. This
>is seen with memory intensive other workloads too. It was seen that when
>CPU caches were dropped e.g. sudo sysctl -w vm.drop_caches=1, the
>variability reduced but the performance was still well below a good run.
>
>Use of __GFP_THISNODE flag ensures that during memory allocation, kernel
>prioritizes memory allocations from the local or closest NUMA node
>thereby reducing memory access latency. When memory is allocated using
>__GFP_THISNODE flag, memory allocations will predominantly be done on
>the local node, consequency, the shrinkers may priotitize reclaiming
>memory from caches assocoated with local node to maintain memory
>locality and minimize latency, thereby provide better shinker targeting.
>
>Reduced memory pressure on remote nodes, can also indirectly influence
>shrinker behavior by potentially reducing the frequency and intensity of
>memory reclamation operation on remote nodes and could provide improved
>overall system performance.
>
>While this change could be more beneficial in general, i.e., without the
>use of a module parameter, but in absence of widespread testing, limit
>it to the AMD GFXIP 9.4.3 based ttm pool initializations only.
>
>
>Cc: Joe Greathouse <joseph.greathouse@amd.com>
>Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
>---
> drivers/gpu/drm/amd/amdgpu/amdgpu.h       |  1 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c   |  8 ++++++++
> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c   |  7 ++++++-
> drivers/gpu/drm/ttm/tests/ttm_pool_test.c | 10 +++++-----
> drivers/gpu/drm/ttm/ttm_device.c          |  2 +-
> drivers/gpu/drm/ttm/ttm_pool.c            |  7 ++++++-
> include/drm/ttm/ttm_pool.h                |  4 +++-
> 7 files changed, 30 insertions(+), 9 deletions(-)
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>index 9c62552bec34..96532cfc6230 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>@@ -253,6 +253,7 @@ extern int amdgpu_user_partt_mode;
> extern int amdgpu_agp;
>
> extern int amdgpu_wbrf;
>+extern bool strict_numa_alloc;
>
> #define AMDGPU_VM_MAX_NUM_CTX			4096
> #define AMDGPU_SG_THRESHOLD			(256*1024*1024)
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>index 80b9642f2bc4..a183a6b4493d 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>@@ -781,6 +781,14 @@ int queue_preemption_timeout_ms = 9000;
> module_param(queue_preemption_timeout_ms, int, 0644);
> MODULE_PARM_DESC(queue_preemption_timeout_ms, "queue preemption
>timeout in ms (1 = Minimum, 9000 = default)");
>
>+/**
>+ * DOC: strict_numa_alloc(bool)
>+ * Policy to force NUMA allocation requests from the proximity NUMA domain
>only.
>+ */
>+bool strict_numa_alloc;
>+module_param(strict_numa_alloc, bool, 0444);
>+MODULE_PARM_DESC(strict_numa_alloc, "Force NUMA allocation requests
>to be satisfied from the closest node only (false = default)");
>+
> /**
>  * DOC: debug_evictions(bool)
>  * Enable extra debug messages to help determine the cause of evictions
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>index b0ed10f4de60..a9f78f85e28c 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>@@ -1768,6 +1768,7 @@ static int amdgpu_ttm_reserve_tmr(struct
>amdgpu_device *adev)
>
> static int amdgpu_ttm_pools_init(struct amdgpu_device *adev)
> {
>+	bool policy = true;
> 	int i;
>
> 	if (!adev->gmc.is_app_apu || !adev->gmc.num_mem_partitions)
>@@ -1779,11 +1780,15 @@ static int amdgpu_ttm_pools_init(struct
>amdgpu_device *adev)
> 	if (!adev->mman.ttm_pools)
> 		return -ENOMEM;
>
>+	/* Policy not only depends on the module param but also on the ASIC
>+	 * setting use_strict_numa_alloc as well.
>+	 */
> 	for (i = 0; i < adev->gmc.num_mem_partitions; i++) {
> 		ttm_pool_init(&adev->mman.ttm_pools[i], adev->dev,
> 			      adev->gmc.mem_partitions[i].numa.node,
>-			      false, false);
>+			      false, false, policy && strict_numa_alloc);

why not just 'strict_numa_alloc'?

Is 'policy' used somewhere else?  Not sure this adds clarity...

> 	}
>+
> 	return 0;
> }
>
>diff --git a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
>b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
>index 2d9cae8cd984..6ff47aac570a 100644
>--- a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
>+++ b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
>@@ -87,7 +87,7 @@ static struct ttm_pool *ttm_pool_pre_populated(struct
>kunit *test,
> 	pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
> 	KUNIT_ASSERT_NOT_NULL(test, pool);
>
>-	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
>+	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false, false);

Is this really an acceptable interface (three non-named Booleans in a row)?

I have no idea what "true, false, false" means.

Would having a "flags" parameter and then

USE_DMA_ALLOC 	BIT(0)
USE_DMA32		BIT(1)
USE_STRICT_NUMA	BIT(2)

Make this more readable?

Or define your Booleans:

USE_DMA_ALLOC 	true
USE_DMA32		true
USE_STRICT_NUMA	true

NO_DMA_ALLOC	false
NO_DMA32		false
NO_STRICT_NUMA	false

So at a minimum, we might know what these parameters are?

What is the relationship between this feature and the nid value?

Is this value used for the allocations?

If this is not NUMA_NO_NODE, would this do the same thing?
(or is the STRICT flag the only way?)

Just some thoughts,

Mike

>
> 	err = ttm_pool_alloc(pool, tt, &simple_ctx);
> 	KUNIT_ASSERT_EQ(test, err, 0);
>@@ -152,7 +152,7 @@ static void ttm_pool_alloc_basic(struct kunit *test)
> 	KUNIT_ASSERT_NOT_NULL(test, pool);
>
> 	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, params-
>>use_dma_alloc,
>-		      false);
>+		      false, false);
>
> 	KUNIT_ASSERT_PTR_EQ(test, pool->dev, devs->dev);
> 	KUNIT_ASSERT_EQ(test, pool->nid, NUMA_NO_NODE);
>@@ -219,7 +219,7 @@ static void ttm_pool_alloc_basic_dma_addr(struct
>kunit *test)
> 	pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
> 	KUNIT_ASSERT_NOT_NULL(test, pool);
>
>-	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
>+	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false, false);
>
> 	err = ttm_pool_alloc(pool, tt, &simple_ctx);
> 	KUNIT_ASSERT_EQ(test, err, 0);
>@@ -349,7 +349,7 @@ static void ttm_pool_free_dma_alloc(struct kunit
>*test)
> 	pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
> 	KUNIT_ASSERT_NOT_NULL(test, pool);
>
>-	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
>+	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false, false);
> 	ttm_pool_alloc(pool, tt, &simple_ctx);
>
> 	pt = &pool->caching[caching].orders[order];
>@@ -380,7 +380,7 @@ static void ttm_pool_free_no_dma_alloc(struct kunit
>*test)
> 	pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
> 	KUNIT_ASSERT_NOT_NULL(test, pool);
>
>-	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, false, false);
>+	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, false, false, false);
> 	ttm_pool_alloc(pool, tt, &simple_ctx);
>
> 	pt = &pool->caching[caching].orders[order];
>diff --git a/drivers/gpu/drm/ttm/ttm_device.c
>b/drivers/gpu/drm/ttm/ttm_device.c
>index f5187b384ae9..540e8a44f015 100644
>--- a/drivers/gpu/drm/ttm/ttm_device.c
>+++ b/drivers/gpu/drm/ttm/ttm_device.c
>@@ -215,7 +215,7 @@ int ttm_device_init(struct ttm_device *bdev, const
>struct ttm_device_funcs *func
>
> 	ttm_sys_man_init(bdev);
>
>-	ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc,
>use_dma32);
>+	ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc,
>use_dma32, false);
>
> 	bdev->vma_manager = vma_manager;
> 	spin_lock_init(&bdev->lru_lock);
>diff --git a/drivers/gpu/drm/ttm/ttm_pool.c
>b/drivers/gpu/drm/ttm/ttm_pool.c
>index dbc96984d331..73aafd06c361 100644
>--- a/drivers/gpu/drm/ttm/ttm_pool.c
>+++ b/drivers/gpu/drm/ttm/ttm_pool.c
>@@ -447,6 +447,9 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct
>ttm_tt *tt,
> 	else
> 		gfp_flags |= GFP_HIGHUSER;
>
>+	if (pool->use_strict_numa_alloc)
>+		gfp_flags |= __GFP_THISNODE;
>+
> 	for (order = min_t(unsigned int, MAX_ORDER, __fls(num_pages));
> 	     num_pages;
> 	     order = min_t(unsigned int, order, __fls(num_pages))) {
>@@ -555,7 +558,8 @@ EXPORT_SYMBOL(ttm_pool_free);
>  * Initialize the pool and its pool types.
>  */
> void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>-		   int nid, bool use_dma_alloc, bool use_dma32)
>+		   int nid, bool use_dma_alloc, bool use_dma32,
>+		   bool use_strict_numa_alloc)
> {
> 	unsigned int i, j;
>
>@@ -565,6 +569,7 @@ void ttm_pool_init(struct ttm_pool *pool, struct
>device *dev,
> 	pool->nid = nid;
> 	pool->use_dma_alloc = use_dma_alloc;
> 	pool->use_dma32 = use_dma32;
>+	pool->use_strict_numa_alloc = use_strict_numa_alloc;
>
> 	if (use_dma_alloc || nid != NUMA_NO_NODE) {
> 		for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
>diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
>index 30a347e5aa11..6b7bdc952466 100644
>--- a/include/drm/ttm/ttm_pool.h
>+++ b/include/drm/ttm/ttm_pool.h
>@@ -72,6 +72,7 @@ struct ttm_pool {
>
> 	bool use_dma_alloc;
> 	bool use_dma32;
>+	bool use_strict_numa_alloc;
>
> 	struct {
> 		struct ttm_pool_type orders[MAX_ORDER + 1];
>@@ -83,7 +84,8 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt
>*tt,
> void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt);
>
> void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>-		   int nid, bool use_dma_alloc, bool use_dma32);
>+		   int nid, bool use_dma_alloc, bool use_dma32,
>+		   bool use_strict_numa_alloc);
> void ttm_pool_fini(struct ttm_pool *pool);
>
> int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m);
>--
>2.34.1
Rajneesh Bhardwaj March 23, 2024, 1:59 a.m. UTC | #3
On 3/22/2024 9:15 AM, Christian König wrote:
> Am 22.03.24 um 08:07 schrieb Rajneesh Bhardwaj:
>> This change allows TTM to be flexible to honor NUMA localized
>> allocations which can result in significant performance improvement on a
>> multi socket NUMA system. On GFXIP 9.4.3 based AMD APUs, we see
>> manyfold benefits of this change resulting not only in ~10% performance
>> improvement in certain benchmarks but also generating more consistent
>> and less sporadic results specially when the NUMA balancing is not
>> explecitely disabled. In certain scenarios, workloads show a run-to-run
>> variability e.g. HPL would show a ~10x performance drop after running
>> back to back 4-5 times and would recover later on a subsequent run. This
>> is seen with memory intensive other workloads too. It was seen that when
>> CPU caches were dropped e.g. sudo sysctl -w vm.drop_caches=1, the
>> variability reduced but the performance was still well below a good run.
>>
>> Use of __GFP_THISNODE flag ensures that during memory allocation, kernel
>> prioritizes memory allocations from the local or closest NUMA node
>> thereby reducing memory access latency.
>
> Exactly that's what it doesn't do.
>
> __GFP_THISNODE just means it enforces allocation from the specified node.

Thanks for the feedback, Christian.

Sure, maybe I should have made it clear in the commit message that there 
is no fallback to zonelist while allocating slab cache. And this is 
exactly what we want, we don't want the pages to be landing on remote nodes.

>
> Additional to that there is a mandatory requirement that this flag is 
> only used if it has to be used for correctness. And that is simply not 
> the case here.


Sorry ,I didn't quite understand this. What is the mandatory 
requirement, could you please clarify this?  Based on the documentation 
I read about this and after reading the kernel source code, I didn't 
find any hard requirement to discourage use of this.


>
> So as long as nobody can explain why that should help this is an 
> absolutely no-go.


Could you please clarify the controversial part here? We have absolutely 
strong backing data the proves the usefulness besides this change 
restricts us to a certain HW IP. Also, the possibilty of OOM killer 
seems to less actually when we use __THISNODE.

https://elixir.bootlin.com/linux/latest/source/mm/page_alloc.c#L3439


Another important thing I want to highlight here is that for AMD 
GFXIP9.4.3 APU, which has unified HBM stack (no RAM/VRAM distinction), 
we get the backing vram pages using ttm_pool_alloc_page(nid) already.



>
> Regards,
> Christian.
>
>>   When memory is allocated using
>> __GFP_THISNODE flag, memory allocations will predominantly be done on
>> the local node, consequency, the shrinkers may priotitize reclaiming
>> memory from caches assocoated with local node to maintain memory
>> locality and minimize latency, thereby provide better shinker targeting.
>>
>> Reduced memory pressure on remote nodes, can also indirectly influence
>> shrinker behavior by potentially reducing the frequency and intensity of
>> memory reclamation operation on remote nodes and could provide improved
>> overall system performance.
>>
>> While this change could be more beneficial in general, i.e., without the
>> use of a module parameter, but in absence of widespread testing, limit
>> it to the AMD GFXIP 9.4.3 based ttm pool initializations only.
>>
>>
>> Cc: Joe Greathouse <joseph.greathouse@amd.com>
>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu.h       |  1 +
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c   |  8 ++++++++
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c   |  7 ++++++-
>>   drivers/gpu/drm/ttm/tests/ttm_pool_test.c | 10 +++++-----
>>   drivers/gpu/drm/ttm/ttm_device.c          |  2 +-
>>   drivers/gpu/drm/ttm/ttm_pool.c            |  7 ++++++-
>>   include/drm/ttm/ttm_pool.h                |  4 +++-
>>   7 files changed, 30 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index 9c62552bec34..96532cfc6230 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -253,6 +253,7 @@ extern int amdgpu_user_partt_mode;
>>   extern int amdgpu_agp;
>>     extern int amdgpu_wbrf;
>> +extern bool strict_numa_alloc;
>>     #define AMDGPU_VM_MAX_NUM_CTX            4096
>>   #define AMDGPU_SG_THRESHOLD            (256*1024*1024)
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> index 80b9642f2bc4..a183a6b4493d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> @@ -781,6 +781,14 @@ int queue_preemption_timeout_ms = 9000;
>>   module_param(queue_preemption_timeout_ms, int, 0644);
>>   MODULE_PARM_DESC(queue_preemption_timeout_ms, "queue preemption 
>> timeout in ms (1 = Minimum, 9000 = default)");
>>   +/**
>> + * DOC: strict_numa_alloc(bool)
>> + * Policy to force NUMA allocation requests from the proximity NUMA 
>> domain only.
>> + */
>> +bool strict_numa_alloc;
>> +module_param(strict_numa_alloc, bool, 0444);
>> +MODULE_PARM_DESC(strict_numa_alloc, "Force NUMA allocation requests 
>> to be satisfied from the closest node only (false = default)");
>> +
>>   /**
>>    * DOC: debug_evictions(bool)
>>    * Enable extra debug messages to help determine the cause of 
>> evictions
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> index b0ed10f4de60..a9f78f85e28c 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> @@ -1768,6 +1768,7 @@ static int amdgpu_ttm_reserve_tmr(struct 
>> amdgpu_device *adev)
>>     static int amdgpu_ttm_pools_init(struct amdgpu_device *adev)
>>   {
>> +    bool policy = true;
>>       int i;
>>         if (!adev->gmc.is_app_apu || !adev->gmc.num_mem_partitions)
>> @@ -1779,11 +1780,15 @@ static int amdgpu_ttm_pools_init(struct 
>> amdgpu_device *adev)
>>       if (!adev->mman.ttm_pools)
>>           return -ENOMEM;
>>   +    /* Policy not only depends on the module param but also on the 
>> ASIC
>> +     * setting use_strict_numa_alloc as well.
>> +     */
>>       for (i = 0; i < adev->gmc.num_mem_partitions; i++) {
>>           ttm_pool_init(&adev->mman.ttm_pools[i], adev->dev,
>>                     adev->gmc.mem_partitions[i].numa.node,
>> -                  false, false);
>> +                  false, false, policy && strict_numa_alloc);
>>       }
>> +
>>       return 0;
>>   }
>>   diff --git a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c 
>> b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
>> index 2d9cae8cd984..6ff47aac570a 100644
>> --- a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
>> +++ b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
>> @@ -87,7 +87,7 @@ static struct ttm_pool 
>> *ttm_pool_pre_populated(struct kunit *test,
>>       pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
>>       KUNIT_ASSERT_NOT_NULL(test, pool);
>>   -    ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
>> +    ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false, false);
>>         err = ttm_pool_alloc(pool, tt, &simple_ctx);
>>       KUNIT_ASSERT_EQ(test, err, 0);
>> @@ -152,7 +152,7 @@ static void ttm_pool_alloc_basic(struct kunit *test)
>>       KUNIT_ASSERT_NOT_NULL(test, pool);
>>         ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, 
>> params->use_dma_alloc,
>> -              false);
>> +              false, false);
>>         KUNIT_ASSERT_PTR_EQ(test, pool->dev, devs->dev);
>>       KUNIT_ASSERT_EQ(test, pool->nid, NUMA_NO_NODE);
>> @@ -219,7 +219,7 @@ static void ttm_pool_alloc_basic_dma_addr(struct 
>> kunit *test)
>>       pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
>>       KUNIT_ASSERT_NOT_NULL(test, pool);
>>   -    ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
>> +    ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false, false);
>>         err = ttm_pool_alloc(pool, tt, &simple_ctx);
>>       KUNIT_ASSERT_EQ(test, err, 0);
>> @@ -349,7 +349,7 @@ static void ttm_pool_free_dma_alloc(struct kunit 
>> *test)
>>       pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
>>       KUNIT_ASSERT_NOT_NULL(test, pool);
>>   -    ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
>> +    ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false, false);
>>       ttm_pool_alloc(pool, tt, &simple_ctx);
>>         pt = &pool->caching[caching].orders[order];
>> @@ -380,7 +380,7 @@ static void ttm_pool_free_no_dma_alloc(struct 
>> kunit *test)
>>       pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
>>       KUNIT_ASSERT_NOT_NULL(test, pool);
>>   -    ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, false, false);
>> +    ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, false, false, false);
>>       ttm_pool_alloc(pool, tt, &simple_ctx);
>>         pt = &pool->caching[caching].orders[order];
>> diff --git a/drivers/gpu/drm/ttm/ttm_device.c 
>> b/drivers/gpu/drm/ttm/ttm_device.c
>> index f5187b384ae9..540e8a44f015 100644
>> --- a/drivers/gpu/drm/ttm/ttm_device.c
>> +++ b/drivers/gpu/drm/ttm/ttm_device.c
>> @@ -215,7 +215,7 @@ int ttm_device_init(struct ttm_device *bdev, 
>> const struct ttm_device_funcs *func
>>         ttm_sys_man_init(bdev);
>>   -    ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), 
>> use_dma_alloc, use_dma32);
>> +    ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, 
>> use_dma32, false);
>>         bdev->vma_manager = vma_manager;
>>       spin_lock_init(&bdev->lru_lock);
>> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c 
>> b/drivers/gpu/drm/ttm/ttm_pool.c
>> index dbc96984d331..73aafd06c361 100644
>> --- a/drivers/gpu/drm/ttm/ttm_pool.c
>> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
>> @@ -447,6 +447,9 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct 
>> ttm_tt *tt,
>>       else
>>           gfp_flags |= GFP_HIGHUSER;
>>   +    if (pool->use_strict_numa_alloc)
>> +        gfp_flags |= __GFP_THISNODE;
>> +
>>       for (order = min_t(unsigned int, MAX_ORDER, __fls(num_pages));
>>            num_pages;
>>            order = min_t(unsigned int, order, __fls(num_pages))) {
>> @@ -555,7 +558,8 @@ EXPORT_SYMBOL(ttm_pool_free);
>>    * Initialize the pool and its pool types.
>>    */
>>   void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>> -           int nid, bool use_dma_alloc, bool use_dma32)
>> +           int nid, bool use_dma_alloc, bool use_dma32,
>> +           bool use_strict_numa_alloc)
>>   {
>>       unsigned int i, j;
>>   @@ -565,6 +569,7 @@ void ttm_pool_init(struct ttm_pool *pool, 
>> struct device *dev,
>>       pool->nid = nid;
>>       pool->use_dma_alloc = use_dma_alloc;
>>       pool->use_dma32 = use_dma32;
>> +    pool->use_strict_numa_alloc = use_strict_numa_alloc;
>>         if (use_dma_alloc || nid != NUMA_NO_NODE) {
>>           for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
>> diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
>> index 30a347e5aa11..6b7bdc952466 100644
>> --- a/include/drm/ttm/ttm_pool.h
>> +++ b/include/drm/ttm/ttm_pool.h
>> @@ -72,6 +72,7 @@ struct ttm_pool {
>>         bool use_dma_alloc;
>>       bool use_dma32;
>> +    bool use_strict_numa_alloc;
>>         struct {
>>           struct ttm_pool_type orders[MAX_ORDER + 1];
>> @@ -83,7 +84,8 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct 
>> ttm_tt *tt,
>>   void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt);
>>     void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>> -           int nid, bool use_dma_alloc, bool use_dma32);
>> +           int nid, bool use_dma_alloc, bool use_dma32,
>> +           bool use_strict_numa_alloc);
>>   void ttm_pool_fini(struct ttm_pool *pool);
>>     int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m);
>
Rajneesh Bhardwaj March 23, 2024, 2:02 a.m. UTC | #4
On 3/22/2024 11:29 AM, Ruhl, Michael J wrote:
>> -----Original Message-----
>> From: dri-devel <dri-devel-bounces@lists.freedesktop.org> On Behalf Of
>> Rajneesh Bhardwaj
>> Sent: Friday, March 22, 2024 3:08 AM
>> To: amd-gfx@lists.freedesktop.org; dri-devel@lists.freedesktop.org
>> Cc: felix.kuehling@amd.com; alexander.deucher@amd.com;
>> christian.koenig@amd.com; Rajneesh Bhardwaj
>> <rajneesh.bhardwaj@amd.com>; Joe Greathouse
>> <joseph.greathouse@amd.com>
>> Subject: [PATCH] drm/ttm: Implement strict NUMA pool allocations
>>
>> This change allows TTM to be flexible to honor NUMA localized
>> allocations which can result in significant performance improvement on a
>> multi socket NUMA system. On GFXIP 9.4.3 based AMD APUs, we see
>> manyfold benefits of this change resulting not only in ~10% performance
>> improvement in certain benchmarks but also generating more consistent
>> and less sporadic results specially when the NUMA balancing is not
>> explecitely disabled. In certain scenarios, workloads show a run-to-run
>> variability e.g. HPL would show a ~10x performance drop after running
>> back to back 4-5 times and would recover later on a subsequent run. This
>> is seen with memory intensive other workloads too. It was seen that when
>> CPU caches were dropped e.g. sudo sysctl -w vm.drop_caches=1, the
>> variability reduced but the performance was still well below a good run.
>>
>> Use of __GFP_THISNODE flag ensures that during memory allocation, kernel
>> prioritizes memory allocations from the local or closest NUMA node
>> thereby reducing memory access latency. When memory is allocated using
>> __GFP_THISNODE flag, memory allocations will predominantly be done on
>> the local node, consequency, the shrinkers may priotitize reclaiming
>> memory from caches assocoated with local node to maintain memory
>> locality and minimize latency, thereby provide better shinker targeting.
>>
>> Reduced memory pressure on remote nodes, can also indirectly influence
>> shrinker behavior by potentially reducing the frequency and intensity of
>> memory reclamation operation on remote nodes and could provide improved
>> overall system performance.
>>
>> While this change could be more beneficial in general, i.e., without the
>> use of a module parameter, but in absence of widespread testing, limit
>> it to the AMD GFXIP 9.4.3 based ttm pool initializations only.
>>
>>
>> Cc: Joe Greathouse <joseph.greathouse@amd.com>
>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu.h       |  1 +
>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c   |  8 ++++++++
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c   |  7 ++++++-
>> drivers/gpu/drm/ttm/tests/ttm_pool_test.c | 10 +++++-----
>> drivers/gpu/drm/ttm/ttm_device.c          |  2 +-
>> drivers/gpu/drm/ttm/ttm_pool.c            |  7 ++++++-
>> include/drm/ttm/ttm_pool.h                |  4 +++-
>> 7 files changed, 30 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index 9c62552bec34..96532cfc6230 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -253,6 +253,7 @@ extern int amdgpu_user_partt_mode;
>> extern int amdgpu_agp;
>>
>> extern int amdgpu_wbrf;
>> +extern bool strict_numa_alloc;
>>
>> #define AMDGPU_VM_MAX_NUM_CTX			4096
>> #define AMDGPU_SG_THRESHOLD			(256*1024*1024)
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> index 80b9642f2bc4..a183a6b4493d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> @@ -781,6 +781,14 @@ int queue_preemption_timeout_ms = 9000;
>> module_param(queue_preemption_timeout_ms, int, 0644);
>> MODULE_PARM_DESC(queue_preemption_timeout_ms, "queue preemption
>> timeout in ms (1 = Minimum, 9000 = default)");
>>
>> +/**
>> + * DOC: strict_numa_alloc(bool)
>> + * Policy to force NUMA allocation requests from the proximity NUMA domain
>> only.
>> + */
>> +bool strict_numa_alloc;
>> +module_param(strict_numa_alloc, bool, 0444);
>> +MODULE_PARM_DESC(strict_numa_alloc, "Force NUMA allocation requests
>> to be satisfied from the closest node only (false = default)");
>> +
>> /**
>>   * DOC: debug_evictions(bool)
>>   * Enable extra debug messages to help determine the cause of evictions
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> index b0ed10f4de60..a9f78f85e28c 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> @@ -1768,6 +1768,7 @@ static int amdgpu_ttm_reserve_tmr(struct
>> amdgpu_device *adev)
>>
>> static int amdgpu_ttm_pools_init(struct amdgpu_device *adev)
>> {
>> +	bool policy = true;
>> 	int i;
>>
>> 	if (!adev->gmc.is_app_apu || !adev->gmc.num_mem_partitions)
>> @@ -1779,11 +1780,15 @@ static int amdgpu_ttm_pools_init(struct
>> amdgpu_device *adev)
>> 	if (!adev->mman.ttm_pools)
>> 		return -ENOMEM;
>>
>> +	/* Policy not only depends on the module param but also on the ASIC
>> +	 * setting use_strict_numa_alloc as well.
>> +	 */
>> 	for (i = 0; i < adev->gmc.num_mem_partitions; i++) {
>> 		ttm_pool_init(&adev->mman.ttm_pools[i], adev->dev,
>> 			      adev->gmc.mem_partitions[i].numa.node,
>> -			      false, false);
>> +			      false, false, policy && strict_numa_alloc);
> why not just 'strict_numa_alloc'?
>
> Is 'policy' used somewhere else?  Not sure this adds clarity...


Thanks Mike for the feedback. I could remove this, just wasn't sure 
about the initial reaction, so just kept the switch off by default so 
keep minimum impact outside of where we want this.


>
>> 	}
>> +
>> 	return 0;
>> }
>>
>> diff --git a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
>> b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
>> index 2d9cae8cd984..6ff47aac570a 100644
>> --- a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
>> +++ b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
>> @@ -87,7 +87,7 @@ static struct ttm_pool *ttm_pool_pre_populated(struct
>> kunit *test,
>> 	pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
>> 	KUNIT_ASSERT_NOT_NULL(test, pool);
>>
>> -	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
>> +	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false, false);
> Is this really an acceptable interface (three non-named Booleans in a row)?
>
> I have no idea what "true, false, false" means.
>
> Would having a "flags" parameter and then
>
> USE_DMA_ALLOC 	BIT(0)
> USE_DMA32		BIT(1)
> USE_STRICT_NUMA	BIT(2)
>
> Make this more readable?


I agree, I just followed the existing convention. If this patch gets 
moving, I could send a follow-up patch to cleanup and document a little bit.

>
> Or define your Booleans:
>
> USE_DMA_ALLOC 	true
> USE_DMA32		true
> USE_STRICT_NUMA	true
>
> NO_DMA_ALLOC	false
> NO_DMA32		false
> NO_STRICT_NUMA	false
>
> So at a minimum, we might know what these parameters are?
>
> What is the relationship between this feature and the nid value?
>
> Is this value used for the allocations?
>
> If this is not NUMA_NO_NODE, would this do the same thing?
> (or is the STRICT flag the only way?)
>
> Just some thoughts,
>
> Mike
>
>> 	err = ttm_pool_alloc(pool, tt, &simple_ctx);
>> 	KUNIT_ASSERT_EQ(test, err, 0);
>> @@ -152,7 +152,7 @@ static void ttm_pool_alloc_basic(struct kunit *test)
>> 	KUNIT_ASSERT_NOT_NULL(test, pool);
>>
>> 	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, params-
>>> use_dma_alloc,
>> -		      false);
>> +		      false, false);
>>
>> 	KUNIT_ASSERT_PTR_EQ(test, pool->dev, devs->dev);
>> 	KUNIT_ASSERT_EQ(test, pool->nid, NUMA_NO_NODE);
>> @@ -219,7 +219,7 @@ static void ttm_pool_alloc_basic_dma_addr(struct
>> kunit *test)
>> 	pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
>> 	KUNIT_ASSERT_NOT_NULL(test, pool);
>>
>> -	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
>> +	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false, false);
>>
>> 	err = ttm_pool_alloc(pool, tt, &simple_ctx);
>> 	KUNIT_ASSERT_EQ(test, err, 0);
>> @@ -349,7 +349,7 @@ static void ttm_pool_free_dma_alloc(struct kunit
>> *test)
>> 	pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
>> 	KUNIT_ASSERT_NOT_NULL(test, pool);
>>
>> -	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
>> +	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false, false);
>> 	ttm_pool_alloc(pool, tt, &simple_ctx);
>>
>> 	pt = &pool->caching[caching].orders[order];
>> @@ -380,7 +380,7 @@ static void ttm_pool_free_no_dma_alloc(struct kunit
>> *test)
>> 	pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
>> 	KUNIT_ASSERT_NOT_NULL(test, pool);
>>
>> -	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, false, false);
>> +	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, false, false, false);
>> 	ttm_pool_alloc(pool, tt, &simple_ctx);
>>
>> 	pt = &pool->caching[caching].orders[order];
>> diff --git a/drivers/gpu/drm/ttm/ttm_device.c
>> b/drivers/gpu/drm/ttm/ttm_device.c
>> index f5187b384ae9..540e8a44f015 100644
>> --- a/drivers/gpu/drm/ttm/ttm_device.c
>> +++ b/drivers/gpu/drm/ttm/ttm_device.c
>> @@ -215,7 +215,7 @@ int ttm_device_init(struct ttm_device *bdev, const
>> struct ttm_device_funcs *func
>>
>> 	ttm_sys_man_init(bdev);
>>
>> -	ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc,
>> use_dma32);
>> +	ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc,
>> use_dma32, false);
>>
>> 	bdev->vma_manager = vma_manager;
>> 	spin_lock_init(&bdev->lru_lock);
>> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c
>> b/drivers/gpu/drm/ttm/ttm_pool.c
>> index dbc96984d331..73aafd06c361 100644
>> --- a/drivers/gpu/drm/ttm/ttm_pool.c
>> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
>> @@ -447,6 +447,9 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct
>> ttm_tt *tt,
>> 	else
>> 		gfp_flags |= GFP_HIGHUSER;
>>
>> +	if (pool->use_strict_numa_alloc)
>> +		gfp_flags |= __GFP_THISNODE;
>> +
>> 	for (order = min_t(unsigned int, MAX_ORDER, __fls(num_pages));
>> 	     num_pages;
>> 	     order = min_t(unsigned int, order, __fls(num_pages))) {
>> @@ -555,7 +558,8 @@ EXPORT_SYMBOL(ttm_pool_free);
>>   * Initialize the pool and its pool types.
>>   */
>> void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>> -		   int nid, bool use_dma_alloc, bool use_dma32)
>> +		   int nid, bool use_dma_alloc, bool use_dma32,
>> +		   bool use_strict_numa_alloc)
>> {
>> 	unsigned int i, j;
>>
>> @@ -565,6 +569,7 @@ void ttm_pool_init(struct ttm_pool *pool, struct
>> device *dev,
>> 	pool->nid = nid;
>> 	pool->use_dma_alloc = use_dma_alloc;
>> 	pool->use_dma32 = use_dma32;
>> +	pool->use_strict_numa_alloc = use_strict_numa_alloc;
>>
>> 	if (use_dma_alloc || nid != NUMA_NO_NODE) {
>> 		for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
>> diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
>> index 30a347e5aa11..6b7bdc952466 100644
>> --- a/include/drm/ttm/ttm_pool.h
>> +++ b/include/drm/ttm/ttm_pool.h
>> @@ -72,6 +72,7 @@ struct ttm_pool {
>>
>> 	bool use_dma_alloc;
>> 	bool use_dma32;
>> +	bool use_strict_numa_alloc;
>>
>> 	struct {
>> 		struct ttm_pool_type orders[MAX_ORDER + 1];
>> @@ -83,7 +84,8 @@ int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt
>> *tt,
>> void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt);
>>
>> void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>> -		   int nid, bool use_dma_alloc, bool use_dma32);
>> +		   int nid, bool use_dma_alloc, bool use_dma32,
>> +		   bool use_strict_numa_alloc);
>> void ttm_pool_fini(struct ttm_pool *pool);
>>
>> int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m);
>> --
>> 2.34.1
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 9c62552bec34..96532cfc6230 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -253,6 +253,7 @@  extern int amdgpu_user_partt_mode;
 extern int amdgpu_agp;
 
 extern int amdgpu_wbrf;
+extern bool strict_numa_alloc;
 
 #define AMDGPU_VM_MAX_NUM_CTX			4096
 #define AMDGPU_SG_THRESHOLD			(256*1024*1024)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 80b9642f2bc4..a183a6b4493d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -781,6 +781,14 @@  int queue_preemption_timeout_ms = 9000;
 module_param(queue_preemption_timeout_ms, int, 0644);
 MODULE_PARM_DESC(queue_preemption_timeout_ms, "queue preemption timeout in ms (1 = Minimum, 9000 = default)");
 
+/**
+ * DOC: strict_numa_alloc(bool)
+ * Policy to force NUMA allocation requests from the proximity NUMA domain only.
+ */
+bool strict_numa_alloc;
+module_param(strict_numa_alloc, bool, 0444);
+MODULE_PARM_DESC(strict_numa_alloc, "Force NUMA allocation requests to be satisfied from the closest node only (false = default)");
+
 /**
  * DOC: debug_evictions(bool)
  * Enable extra debug messages to help determine the cause of evictions
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index b0ed10f4de60..a9f78f85e28c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1768,6 +1768,7 @@  static int amdgpu_ttm_reserve_tmr(struct amdgpu_device *adev)
 
 static int amdgpu_ttm_pools_init(struct amdgpu_device *adev)
 {
+	bool policy = true;
 	int i;
 
 	if (!adev->gmc.is_app_apu || !adev->gmc.num_mem_partitions)
@@ -1779,11 +1780,15 @@  static int amdgpu_ttm_pools_init(struct amdgpu_device *adev)
 	if (!adev->mman.ttm_pools)
 		return -ENOMEM;
 
+	/* Policy not only depends on the module param but also on the ASIC
+	 * setting use_strict_numa_alloc as well.
+	 */
 	for (i = 0; i < adev->gmc.num_mem_partitions; i++) {
 		ttm_pool_init(&adev->mman.ttm_pools[i], adev->dev,
 			      adev->gmc.mem_partitions[i].numa.node,
-			      false, false);
+			      false, false, policy && strict_numa_alloc);
 	}
+
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
index 2d9cae8cd984..6ff47aac570a 100644
--- a/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
+++ b/drivers/gpu/drm/ttm/tests/ttm_pool_test.c
@@ -87,7 +87,7 @@  static struct ttm_pool *ttm_pool_pre_populated(struct kunit *test,
 	pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
 	KUNIT_ASSERT_NOT_NULL(test, pool);
 
-	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
+	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false, false);
 
 	err = ttm_pool_alloc(pool, tt, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
@@ -152,7 +152,7 @@  static void ttm_pool_alloc_basic(struct kunit *test)
 	KUNIT_ASSERT_NOT_NULL(test, pool);
 
 	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, params->use_dma_alloc,
-		      false);
+		      false, false);
 
 	KUNIT_ASSERT_PTR_EQ(test, pool->dev, devs->dev);
 	KUNIT_ASSERT_EQ(test, pool->nid, NUMA_NO_NODE);
@@ -219,7 +219,7 @@  static void ttm_pool_alloc_basic_dma_addr(struct kunit *test)
 	pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
 	KUNIT_ASSERT_NOT_NULL(test, pool);
 
-	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
+	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false, false);
 
 	err = ttm_pool_alloc(pool, tt, &simple_ctx);
 	KUNIT_ASSERT_EQ(test, err, 0);
@@ -349,7 +349,7 @@  static void ttm_pool_free_dma_alloc(struct kunit *test)
 	pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
 	KUNIT_ASSERT_NOT_NULL(test, pool);
 
-	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false);
+	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, true, false, false);
 	ttm_pool_alloc(pool, tt, &simple_ctx);
 
 	pt = &pool->caching[caching].orders[order];
@@ -380,7 +380,7 @@  static void ttm_pool_free_no_dma_alloc(struct kunit *test)
 	pool = kunit_kzalloc(test, sizeof(*pool), GFP_KERNEL);
 	KUNIT_ASSERT_NOT_NULL(test, pool);
 
-	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, false, false);
+	ttm_pool_init(pool, devs->dev, NUMA_NO_NODE, false, false, false);
 	ttm_pool_alloc(pool, tt, &simple_ctx);
 
 	pt = &pool->caching[caching].orders[order];
diff --git a/drivers/gpu/drm/ttm/ttm_device.c b/drivers/gpu/drm/ttm/ttm_device.c
index f5187b384ae9..540e8a44f015 100644
--- a/drivers/gpu/drm/ttm/ttm_device.c
+++ b/drivers/gpu/drm/ttm/ttm_device.c
@@ -215,7 +215,7 @@  int ttm_device_init(struct ttm_device *bdev, const struct ttm_device_funcs *func
 
 	ttm_sys_man_init(bdev);
 
-	ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, use_dma32);
+	ttm_pool_init(&bdev->pool, dev, dev_to_node(dev), use_dma_alloc, use_dma32, false);
 
 	bdev->vma_manager = vma_manager;
 	spin_lock_init(&bdev->lru_lock);
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
index dbc96984d331..73aafd06c361 100644
--- a/drivers/gpu/drm/ttm/ttm_pool.c
+++ b/drivers/gpu/drm/ttm/ttm_pool.c
@@ -447,6 +447,9 @@  int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 	else
 		gfp_flags |= GFP_HIGHUSER;
 
+	if (pool->use_strict_numa_alloc)
+		gfp_flags |= __GFP_THISNODE;
+
 	for (order = min_t(unsigned int, MAX_ORDER, __fls(num_pages));
 	     num_pages;
 	     order = min_t(unsigned int, order, __fls(num_pages))) {
@@ -555,7 +558,8 @@  EXPORT_SYMBOL(ttm_pool_free);
  * Initialize the pool and its pool types.
  */
 void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
-		   int nid, bool use_dma_alloc, bool use_dma32)
+		   int nid, bool use_dma_alloc, bool use_dma32,
+		   bool use_strict_numa_alloc)
 {
 	unsigned int i, j;
 
@@ -565,6 +569,7 @@  void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
 	pool->nid = nid;
 	pool->use_dma_alloc = use_dma_alloc;
 	pool->use_dma32 = use_dma32;
+	pool->use_strict_numa_alloc = use_strict_numa_alloc;
 
 	if (use_dma_alloc || nid != NUMA_NO_NODE) {
 		for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
diff --git a/include/drm/ttm/ttm_pool.h b/include/drm/ttm/ttm_pool.h
index 30a347e5aa11..6b7bdc952466 100644
--- a/include/drm/ttm/ttm_pool.h
+++ b/include/drm/ttm/ttm_pool.h
@@ -72,6 +72,7 @@  struct ttm_pool {
 
 	bool use_dma_alloc;
 	bool use_dma32;
+	bool use_strict_numa_alloc;
 
 	struct {
 		struct ttm_pool_type orders[MAX_ORDER + 1];
@@ -83,7 +84,8 @@  int ttm_pool_alloc(struct ttm_pool *pool, struct ttm_tt *tt,
 void ttm_pool_free(struct ttm_pool *pool, struct ttm_tt *tt);
 
 void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
-		   int nid, bool use_dma_alloc, bool use_dma32);
+		   int nid, bool use_dma_alloc, bool use_dma32,
+		   bool use_strict_numa_alloc);
 void ttm_pool_fini(struct ttm_pool *pool);
 
 int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m);