diff mbox series

[v3,2/2] drm/xe/lnl: Offload system clear page activity to GPU

Message ID 20240624141456.23195-2-nirmoy.das@intel.com (mailing list archive)
State New, archived
Headers show
Series [v3,1/2] drm/ttm: Add a flag to allow drivers to skip clear-on-free | expand

Commit Message

Nirmoy Das June 24, 2024, 2:14 p.m. UTC
On LNL because of flat CCS, driver creates a migrate job to clear
CCS meta data. Extend that to also clear system pages using GPU.
Inform TTM to allocate pages without __GFP_ZERO to avoid double page
clearing by clearing out TTM_TT_FLAG_ZERO_ALLOC flag and set
TTM_TT_FLAG_CLEARED_ON_FREE while freeing to skip ttm pool's
clearn-on-free as XE now takes care of clearing pages.

To test the patch, I created a small test that tries to submit a job
after binding various sizes of buffer which shows good gains for larger
buffer. For lower buffer sizes, the result is not very reliable as the
results vary a lot.

With the patch
sudo  ~/igt-gpu-tools/build/tests/xe_exec_store --run basic-store-benchmark
IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.10.0-rc2-xe+ x86_64)
Using IGT_SRANDOM=1719237905 for randomisation
Opened device: /dev/dri/card0
Starting subtest: basic-store-benchmark
Starting dynamic subtest: WC
Dynamic subtest WC: SUCCESS (0.000s)
Time taken for size SZ_4K: 9493 us
Time taken for size SZ_2M: 5503 us
Time taken for size SZ_64M: 13016 us
Time taken for size SZ_128M: 29464 us
Time taken for size SZ_256M: 38408 us
Time taken for size SZ_1G: 148758 us
Starting dynamic subtest: WB
Dynamic subtest WB: SUCCESS (0.000s)
Time taken for size SZ_4K: 3889 us
Time taken for size SZ_2M: 6091 us
Time taken for size SZ_64M: 20920 us
Time taken for size SZ_128M: 32394 us
Time taken for size SZ_256M: 61710 us
Time taken for size SZ_1G: 215437 us
Subtest basic-store-benchmark: SUCCESS (0.589s)

With the patch:
sudo  ~/igt-gpu-tools/build/tests/xe_exec_store --run basic-store-benchmark
IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.10.0-rc2-xe+ x86_64)
Using IGT_SRANDOM=1719238062 for randomisation
Opened device: /dev/dri/card0
Starting subtest: basic-store-benchmark
Starting dynamic subtest: WC
Dynamic subtest WC: SUCCESS (0.000s)
Time taken for size SZ_4K: 11803 us
Time taken for size SZ_2M: 4237 us
Time taken for size SZ_64M: 8649 us
Time taken for size SZ_128M: 14682 us
Time taken for size SZ_256M: 22156 us
Time taken for size SZ_1G: 74457 us
Starting dynamic subtest: WB
Dynamic subtest WB: SUCCESS (0.000s)
Time taken for size SZ_4K: 5129 us
Time taken for size SZ_2M: 12563 us
Time taken for size SZ_64M: 14860 us
Time taken for size SZ_128M: 26064 us
Time taken for size SZ_256M: 47167 us
Time taken for size SZ_1G: 170304 us
Subtest basic-store-benchmark: SUCCESS (0.417s)

With the patch and init_on_alloc=0
sudo  ~/igt-gpu-tools/build/tests/xe_exec_store --run basic-store-benchmark
IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.10.0-rc2-xe+ x86_64)
Using IGT_SRANDOM=1719238219 for randomisation
Opened device: /dev/dri/card0
Starting subtest: basic-store-benchmark
Starting dynamic subtest: WC
Dynamic subtest WC: SUCCESS (0.000s)
Time taken for size SZ_4K: 4803 us
Time taken for size SZ_2M: 9212 us
Time taken for size SZ_64M: 9643 us
Time taken for size SZ_128M: 13479 us
Time taken for size SZ_256M: 22429 us
Time taken for size SZ_1G: 83110 us
Starting dynamic subtest: WB
Dynamic subtest WB: SUCCESS (0.000s)
Time taken for size SZ_4K: 4003 us
Time taken for size SZ_2M: 4443 us
Time taken for size SZ_64M: 12960 us
Time taken for size SZ_128M: 13741 us
Time taken for size SZ_256M: 26841 us
Time taken for size SZ_1G: 84746 us
Subtest basic-store-benchmark: SUCCESS (0.290s)

v2: Handle regression on dgfx(Himal)
    Update commit message as no ttm API changes needed.
v3: Fix Kunit test.

Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
---
 drivers/gpu/drm/xe/xe_bo.c           | 11 +++++++++++
 drivers/gpu/drm/xe/xe_device.c       |  7 +++++++
 drivers/gpu/drm/xe/xe_device_types.h |  2 ++
 drivers/gpu/drm/xe/xe_migrate.c      |  5 +++--
 4 files changed, 23 insertions(+), 2 deletions(-)

Comments

Thomas Hellstrom June 27, 2024, 7:55 a.m. UTC | #1
Hi Nirmoy

On Mon, 2024-06-24 at 16:14 +0200, Nirmoy Das wrote:
> On LNL because of flat CCS, driver creates a migrate job to clear
> CCS meta data. Extend that to also clear system pages using GPU.
> Inform TTM to allocate pages without __GFP_ZERO to avoid double page
> clearing by clearing out TTM_TT_FLAG_ZERO_ALLOC flag and set
> TTM_TT_FLAG_CLEARED_ON_FREE while freeing to skip ttm pool's
> clearn-on-free as XE now takes care of clearing pages.
> 
> To test the patch, I created a small test that tries to submit a job
> after binding various sizes of buffer which shows good gains for
> larger
> buffer. For lower buffer sizes, the result is not very reliable as
> the
> results vary a lot.

Some concerns below,

also a big security concern. 

The CCS clearing occurs when the bo is moved to TT. But there are
situations in which the bo is created and populated in system. For
example if the bo is created using
DRM_XE_GEM_CREATE_FLAG_DEFER_BACKING, and then mmap'd Then it won't get
cleared. Since we don't have a dma_mapping of the bo at that time we
must revert to cpu clear when / that happens.

> 
> With the patch
> sudo  ~/igt-gpu-tools/build/tests/xe_exec_store --run basic-store-
> benchmark
> IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.10.0-rc2-xe+ x86_64)
> Using IGT_SRANDOM=1719237905 for randomisation
> Opened device: /dev/dri/card0
> Starting subtest: basic-store-benchmark
> Starting dynamic subtest: WC
> Dynamic subtest WC: SUCCESS (0.000s)
> Time taken for size SZ_4K: 9493 us
> Time taken for size SZ_2M: 5503 us
> Time taken for size SZ_64M: 13016 us
> Time taken for size SZ_128M: 29464 us
> Time taken for size SZ_256M: 38408 us
> Time taken for size SZ_1G: 148758 us
> Starting dynamic subtest: WB
> Dynamic subtest WB: SUCCESS (0.000s)
> Time taken for size SZ_4K: 3889 us
> Time taken for size SZ_2M: 6091 us
> Time taken for size SZ_64M: 20920 us
> Time taken for size SZ_128M: 32394 us
> Time taken for size SZ_256M: 61710 us
> Time taken for size SZ_1G: 215437 us
> Subtest basic-store-benchmark: SUCCESS (0.589s)
> 
> With the patch:
> sudo  ~/igt-gpu-tools/build/tests/xe_exec_store --run basic-store-
> benchmark
> IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.10.0-rc2-xe+ x86_64)
> Using IGT_SRANDOM=1719238062 for randomisation
> Opened device: /dev/dri/card0
> Starting subtest: basic-store-benchmark
> Starting dynamic subtest: WC
> Dynamic subtest WC: SUCCESS (0.000s)
> Time taken for size SZ_4K: 11803 us
> Time taken for size SZ_2M: 4237 us
> Time taken for size SZ_64M: 8649 us
> Time taken for size SZ_128M: 14682 us
> Time taken for size SZ_256M: 22156 us
> Time taken for size SZ_1G: 74457 us
> Starting dynamic subtest: WB
> Dynamic subtest WB: SUCCESS (0.000s)
> Time taken for size SZ_4K: 5129 us
> Time taken for size SZ_2M: 12563 us
> Time taken for size SZ_64M: 14860 us
> Time taken for size SZ_128M: 26064 us
> Time taken for size SZ_256M: 47167 us
> Time taken for size SZ_1G: 170304 us
> Subtest basic-store-benchmark: SUCCESS (0.417s)
> 
> With the patch and init_on_alloc=0
> sudo  ~/igt-gpu-tools/build/tests/xe_exec_store --run basic-store-
> benchmark
> IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.10.0-rc2-xe+ x86_64)
> Using IGT_SRANDOM=1719238219 for randomisation
> Opened device: /dev/dri/card0
> Starting subtest: basic-store-benchmark
> Starting dynamic subtest: WC
> Dynamic subtest WC: SUCCESS (0.000s)
> Time taken for size SZ_4K: 4803 us
> Time taken for size SZ_2M: 9212 us
> Time taken for size SZ_64M: 9643 us
> Time taken for size SZ_128M: 13479 us
> Time taken for size SZ_256M: 22429 us
> Time taken for size SZ_1G: 83110 us
> Starting dynamic subtest: WB
> Dynamic subtest WB: SUCCESS (0.000s)
> Time taken for size SZ_4K: 4003 us
> Time taken for size SZ_2M: 4443 us
> Time taken for size SZ_64M: 12960 us
> Time taken for size SZ_128M: 13741 us
> Time taken for size SZ_256M: 26841 us
> Time taken for size SZ_1G: 84746 us
> Subtest basic-store-benchmark: SUCCESS (0.290s)
> 
> v2: Handle regression on dgfx(Himal)
>     Update commit message as no ttm API changes needed.
> v3: Fix Kunit test.
> 
> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
> ---
>  drivers/gpu/drm/xe/xe_bo.c           | 11 +++++++++++
>  drivers/gpu/drm/xe/xe_device.c       |  7 +++++++
>  drivers/gpu/drm/xe/xe_device_types.h |  2 ++
>  drivers/gpu/drm/xe/xe_migrate.c      |  5 +++--
>  4 files changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> index 65c696966e96..a9ce4347a7d7 100644
> --- a/drivers/gpu/drm/xe/xe_bo.c
> +++ b/drivers/gpu/drm/xe/xe_bo.c
> @@ -399,6 +399,7 @@ static struct ttm_tt *xe_ttm_tt_create(struct
> ttm_buffer_object *ttm_bo,
>  static int xe_ttm_tt_populate(struct ttm_device *ttm_dev, struct
> ttm_tt *tt,
>  			      struct ttm_operation_ctx *ctx)
>  {
> +	struct xe_device *xe = ttm_to_xe_device(ttm_dev);
>  	int err;
>  
>  	/*
> @@ -408,6 +409,10 @@ static int xe_ttm_tt_populate(struct ttm_device
> *ttm_dev, struct ttm_tt *tt,
>  	if (tt->page_flags & TTM_TT_FLAG_EXTERNAL)
>  		return 0;
>  
> +	/* Clear TTM_TT_FLAG_ZERO_ALLOC when GPU is set to clear
> pages */
> +	if (xe->mem.gpu_page_clear)
> +		tt->page_flags &= ~TTM_TT_FLAG_ZERO_ALLOC;
> +
>  	err = ttm_pool_alloc(&ttm_dev->pool, tt, ctx);
>  	if (err)
>  		return err;
> @@ -417,9 +422,15 @@ static int xe_ttm_tt_populate(struct ttm_device
> *ttm_dev, struct ttm_tt *tt,
>  
>  static void xe_ttm_tt_unpopulate(struct ttm_device *ttm_dev, struct
> ttm_tt *tt)
>  {
> +	struct xe_device *xe = ttm_to_xe_device(ttm_dev);
> +
>  	if (tt->page_flags & TTM_TT_FLAG_EXTERNAL)
>  		return;
>  
> +	/* Add TTM_TT_FLAG_CLEARED_ON_FREE when GPU is set to clear
> pages */
> +	if (xe->mem.gpu_page_clear)
> +		tt->page_flags |= TTM_TT_FLAG_CLEARED_ON_FREE;
> +
>  	xe_tt_unmap_sg(tt);
>  
>  	return ttm_pool_free(&ttm_dev->pool, tt);
> diff --git a/drivers/gpu/drm/xe/xe_device.c
> b/drivers/gpu/drm/xe/xe_device.c
> index ca5e8435485a..c9afff1d0734 100644
> --- a/drivers/gpu/drm/xe/xe_device.c
> +++ b/drivers/gpu/drm/xe/xe_device.c
> @@ -636,6 +636,13 @@ int xe_device_probe(struct xe_device *xe)
>  	if (err)
>  		goto err_irq_shutdown;
>  
> +	/**
> +	 * On iGFX device with flat CCS, we clear CCS metadata,
> let's extend that
> +	 * and use GPU to clear pages as well.
> +	 */
> +	if (xe_device_has_flat_ccs(xe) && !IS_DGFX(xe))
> +		xe->mem.gpu_page_clear = true;
> +
>  	err = xe_vram_probe(xe);
>  	if (err)
>  		goto err_irq_shutdown;
> diff --git a/drivers/gpu/drm/xe/xe_device_types.h
> b/drivers/gpu/drm/xe/xe_device_types.h
> index c37be471d11c..ece68c6f3668 100644
> --- a/drivers/gpu/drm/xe/xe_device_types.h
> +++ b/drivers/gpu/drm/xe/xe_device_types.h
> @@ -325,6 +325,8 @@ struct xe_device {
>  		struct xe_mem_region vram;
>  		/** @mem.sys_mgr: system TTM manager */
>  		struct ttm_resource_manager sys_mgr;
> +		/** @gpu_page_clear: clear pages offloaded to GPU */
> +		bool gpu_page_clear;
>  	} mem;
>  
>  	/** @sriov: device level virtualization data */
> diff --git a/drivers/gpu/drm/xe/xe_migrate.c
> b/drivers/gpu/drm/xe/xe_migrate.c
> index 05f933787860..61bf3d80e896 100644
> --- a/drivers/gpu/drm/xe/xe_migrate.c
> +++ b/drivers/gpu/drm/xe/xe_migrate.c
> @@ -1003,6 +1003,7 @@ struct dma_fence *xe_migrate_clear(struct
> xe_migrate *m,
>  	struct xe_gt *gt = m->tile->primary_gt;
>  	struct xe_device *xe = gt_to_xe(gt);
>  	bool clear_system_ccs = (xe_bo_needs_ccs_pages(bo) &&
> !IS_DGFX(xe)) ? true : false;
> +	bool clear_on_create = xe->mem.gpu_page_clear ||
> !clear_system_ccs;

Hm. In what situation is clear_on_create false? 

Previously we had clear_system_ccs to clear *only* ccs, but now that
situation has changed to clear_also_ccs?

I think the xe_migrate_clear should not bother whether this is an
initial clearing or a clearing for other reasons, but rather be passed
two flags

"clear_bo_data" and "clear_ccs" or something similar, and also we
should avoid variable names that refer to usage by higher layers of
which the migrate code should have no knowledge.

/Thomas


>  	struct dma_fence *fence = NULL;
>  	u64 size = bo->size;
>  	struct xe_res_cursor src_it;
> @@ -1032,7 +1033,7 @@ struct dma_fence *xe_migrate_clear(struct
> xe_migrate *m,
>  		batch_size = 2 +
>  			pte_update_size(m, clear_vram, src, &src_it,
>  					&clear_L0, &clear_L0_ofs,
> &clear_L0_pt,
> -					clear_system_ccs ? 0 :
> emit_clear_cmd_len(gt), 0,
> +					!clear_on_create ? 0 :
> emit_clear_cmd_len(gt), 0,
>  					avail_pts);
>  
>  		if (xe_device_has_flat_ccs(xe))
> @@ -1060,7 +1061,7 @@ struct dma_fence *xe_migrate_clear(struct
> xe_migrate *m,
>  		bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
>  		update_idx = bb->len;
>  
> -		if (!clear_system_ccs)
> +		if (clear_on_create)
>  			emit_clear(gt, bb, clear_L0_ofs, clear_L0,
> XE_PAGE_SIZE, clear_vram);
>  
>  		if (xe_device_has_flat_ccs(xe)) {
Nirmoy Das June 27, 2024, 5:04 p.m. UTC | #2
On 6/27/2024 9:55 AM, Thomas Hellström wrote:
> Hi Nirmoy
>
> On Mon, 2024-06-24 at 16:14 +0200, Nirmoy Das wrote:
>> On LNL because of flat CCS, driver creates a migrate job to clear
>> CCS meta data. Extend that to also clear system pages using GPU.
>> Inform TTM to allocate pages without __GFP_ZERO to avoid double page
>> clearing by clearing out TTM_TT_FLAG_ZERO_ALLOC flag and set
>> TTM_TT_FLAG_CLEARED_ON_FREE while freeing to skip ttm pool's
>> clearn-on-free as XE now takes care of clearing pages.
>>
>> To test the patch, I created a small test that tries to submit a job
>> after binding various sizes of buffer which shows good gains for
>> larger
>> buffer. For lower buffer sizes, the result is not very reliable as
>> the
>> results vary a lot.
> Some concerns below,
>
> also a big security concern.
>
> The CCS clearing occurs when the bo is moved to TT. But there are
> situations in which the bo is created and populated in system. For
> example if the bo is created using
> DRM_XE_GEM_CREATE_FLAG_DEFER_BACKING, and then mmap'd Then it won't get
> cleared. Since we don't have a dma_mapping of the bo at that time we
> must revert to cpu clear when / that happens.

Thanks for pointing this out. I completely missed the fact  that BO can 
be accessed only by CPU.


>
>> With the patch
>> sudo  ~/igt-gpu-tools/build/tests/xe_exec_store --run basic-store-
>> benchmark
>> IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.10.0-rc2-xe+ x86_64)
>> Using IGT_SRANDOM=1719237905 for randomisation
>> Opened device: /dev/dri/card0
>> Starting subtest: basic-store-benchmark
>> Starting dynamic subtest: WC
>> Dynamic subtest WC: SUCCESS (0.000s)
>> Time taken for size SZ_4K: 9493 us
>> Time taken for size SZ_2M: 5503 us
>> Time taken for size SZ_64M: 13016 us
>> Time taken for size SZ_128M: 29464 us
>> Time taken for size SZ_256M: 38408 us
>> Time taken for size SZ_1G: 148758 us
>> Starting dynamic subtest: WB
>> Dynamic subtest WB: SUCCESS (0.000s)
>> Time taken for size SZ_4K: 3889 us
>> Time taken for size SZ_2M: 6091 us
>> Time taken for size SZ_64M: 20920 us
>> Time taken for size SZ_128M: 32394 us
>> Time taken for size SZ_256M: 61710 us
>> Time taken for size SZ_1G: 215437 us
>> Subtest basic-store-benchmark: SUCCESS (0.589s)
>>
>> With the patch:
>> sudo  ~/igt-gpu-tools/build/tests/xe_exec_store --run basic-store-
>> benchmark
>> IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.10.0-rc2-xe+ x86_64)
>> Using IGT_SRANDOM=1719238062 for randomisation
>> Opened device: /dev/dri/card0
>> Starting subtest: basic-store-benchmark
>> Starting dynamic subtest: WC
>> Dynamic subtest WC: SUCCESS (0.000s)
>> Time taken for size SZ_4K: 11803 us
>> Time taken for size SZ_2M: 4237 us
>> Time taken for size SZ_64M: 8649 us
>> Time taken for size SZ_128M: 14682 us
>> Time taken for size SZ_256M: 22156 us
>> Time taken for size SZ_1G: 74457 us
>> Starting dynamic subtest: WB
>> Dynamic subtest WB: SUCCESS (0.000s)
>> Time taken for size SZ_4K: 5129 us
>> Time taken for size SZ_2M: 12563 us
>> Time taken for size SZ_64M: 14860 us
>> Time taken for size SZ_128M: 26064 us
>> Time taken for size SZ_256M: 47167 us
>> Time taken for size SZ_1G: 170304 us
>> Subtest basic-store-benchmark: SUCCESS (0.417s)
>>
>> With the patch and init_on_alloc=0
>> sudo  ~/igt-gpu-tools/build/tests/xe_exec_store --run basic-store-
>> benchmark
>> IGT-Version: 1.28-g2ed908c0b (x86_64) (Linux: 6.10.0-rc2-xe+ x86_64)
>> Using IGT_SRANDOM=1719238219 for randomisation
>> Opened device: /dev/dri/card0
>> Starting subtest: basic-store-benchmark
>> Starting dynamic subtest: WC
>> Dynamic subtest WC: SUCCESS (0.000s)
>> Time taken for size SZ_4K: 4803 us
>> Time taken for size SZ_2M: 9212 us
>> Time taken for size SZ_64M: 9643 us
>> Time taken for size SZ_128M: 13479 us
>> Time taken for size SZ_256M: 22429 us
>> Time taken for size SZ_1G: 83110 us
>> Starting dynamic subtest: WB
>> Dynamic subtest WB: SUCCESS (0.000s)
>> Time taken for size SZ_4K: 4003 us
>> Time taken for size SZ_2M: 4443 us
>> Time taken for size SZ_64M: 12960 us
>> Time taken for size SZ_128M: 13741 us
>> Time taken for size SZ_256M: 26841 us
>> Time taken for size SZ_1G: 84746 us
>> Subtest basic-store-benchmark: SUCCESS (0.290s)
>>
>> v2: Handle regression on dgfx(Himal)
>>      Update commit message as no ttm API changes needed.
>> v3: Fix Kunit test.
>>
>> Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
>> Cc: Matthew Auld <matthew.auld@intel.com>
>> Cc: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
>> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
>> ---
>>   drivers/gpu/drm/xe/xe_bo.c           | 11 +++++++++++
>>   drivers/gpu/drm/xe/xe_device.c       |  7 +++++++
>>   drivers/gpu/drm/xe/xe_device_types.h |  2 ++
>>   drivers/gpu/drm/xe/xe_migrate.c      |  5 +++--
>>   4 files changed, 23 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
>> index 65c696966e96..a9ce4347a7d7 100644
>> --- a/drivers/gpu/drm/xe/xe_bo.c
>> +++ b/drivers/gpu/drm/xe/xe_bo.c
>> @@ -399,6 +399,7 @@ static struct ttm_tt *xe_ttm_tt_create(struct
>> ttm_buffer_object *ttm_bo,
>>   static int xe_ttm_tt_populate(struct ttm_device *ttm_dev, struct
>> ttm_tt *tt,
>>   			      struct ttm_operation_ctx *ctx)
>>   {
>> +	struct xe_device *xe = ttm_to_xe_device(ttm_dev);
>>   	int err;
>>   
>>   	/*
>> @@ -408,6 +409,10 @@ static int xe_ttm_tt_populate(struct ttm_device
>> *ttm_dev, struct ttm_tt *tt,
>>   	if (tt->page_flags & TTM_TT_FLAG_EXTERNAL)
>>   		return 0;
>>   
>> +	/* Clear TTM_TT_FLAG_ZERO_ALLOC when GPU is set to clear
>> pages */
>> +	if (xe->mem.gpu_page_clear)
>> +		tt->page_flags &= ~TTM_TT_FLAG_ZERO_ALLOC;
>> +
>>   	err = ttm_pool_alloc(&ttm_dev->pool, tt, ctx);
>>   	if (err)
>>   		return err;
>> @@ -417,9 +422,15 @@ static int xe_ttm_tt_populate(struct ttm_device
>> *ttm_dev, struct ttm_tt *tt,
>>   
>>   static void xe_ttm_tt_unpopulate(struct ttm_device *ttm_dev, struct
>> ttm_tt *tt)
>>   {
>> +	struct xe_device *xe = ttm_to_xe_device(ttm_dev);
>> +
>>   	if (tt->page_flags & TTM_TT_FLAG_EXTERNAL)
>>   		return;
>>   
>> +	/* Add TTM_TT_FLAG_CLEARED_ON_FREE when GPU is set to clear
>> pages */
>> +	if (xe->mem.gpu_page_clear)
>> +		tt->page_flags |= TTM_TT_FLAG_CLEARED_ON_FREE;
>> +
>>   	xe_tt_unmap_sg(tt);
>>   
>>   	return ttm_pool_free(&ttm_dev->pool, tt);
>> diff --git a/drivers/gpu/drm/xe/xe_device.c
>> b/drivers/gpu/drm/xe/xe_device.c
>> index ca5e8435485a..c9afff1d0734 100644
>> --- a/drivers/gpu/drm/xe/xe_device.c
>> +++ b/drivers/gpu/drm/xe/xe_device.c
>> @@ -636,6 +636,13 @@ int xe_device_probe(struct xe_device *xe)
>>   	if (err)
>>   		goto err_irq_shutdown;
>>   
>> +	/**
>> +	 * On iGFX device with flat CCS, we clear CCS metadata,
>> let's extend that
>> +	 * and use GPU to clear pages as well.
>> +	 */
>> +	if (xe_device_has_flat_ccs(xe) && !IS_DGFX(xe))
>> +		xe->mem.gpu_page_clear = true;
>> +
>>   	err = xe_vram_probe(xe);
>>   	if (err)
>>   		goto err_irq_shutdown;
>> diff --git a/drivers/gpu/drm/xe/xe_device_types.h
>> b/drivers/gpu/drm/xe/xe_device_types.h
>> index c37be471d11c..ece68c6f3668 100644
>> --- a/drivers/gpu/drm/xe/xe_device_types.h
>> +++ b/drivers/gpu/drm/xe/xe_device_types.h
>> @@ -325,6 +325,8 @@ struct xe_device {
>>   		struct xe_mem_region vram;
>>   		/** @mem.sys_mgr: system TTM manager */
>>   		struct ttm_resource_manager sys_mgr;
>> +		/** @gpu_page_clear: clear pages offloaded to GPU */
>> +		bool gpu_page_clear;
>>   	} mem;
>>   
>>   	/** @sriov: device level virtualization data */
>> diff --git a/drivers/gpu/drm/xe/xe_migrate.c
>> b/drivers/gpu/drm/xe/xe_migrate.c
>> index 05f933787860..61bf3d80e896 100644
>> --- a/drivers/gpu/drm/xe/xe_migrate.c
>> +++ b/drivers/gpu/drm/xe/xe_migrate.c
>> @@ -1003,6 +1003,7 @@ struct dma_fence *xe_migrate_clear(struct
>> xe_migrate *m,
>>   	struct xe_gt *gt = m->tile->primary_gt;
>>   	struct xe_device *xe = gt_to_xe(gt);
>>   	bool clear_system_ccs = (xe_bo_needs_ccs_pages(bo) &&
>> !IS_DGFX(xe)) ? true : false;
>> +	bool clear_on_create = xe->mem.gpu_page_clear ||
>> !clear_system_ccs;
> Hm. In what situation is clear_on_create false?
On igfx < xe2
> Previously we had clear_system_ccs to clear *only* ccs, but now that
> situation has changed to clear_also_ccs?
>
> I think the xe_migrate_clear should not bother whether this is an
> initial clearing or a clearing for other reasons, but rather be passed
> two flags
>
> "clear_bo_data" and "clear_ccs" or something similar, and also we
> should avoid variable names that refer to usage by higher layers of
> which the migrate code should have no knowledge.

v3 was to  fix kunit test which should be fix with a double param that 
you are suggesting.

Thanks for looking into this. I will resend with above suggestions and 
after some more testing.


Thanks,

Nirmoy

>
> /Thomas
>
>
>>   	struct dma_fence *fence = NULL;
>>   	u64 size = bo->size;
>>   	struct xe_res_cursor src_it;
>> @@ -1032,7 +1033,7 @@ struct dma_fence *xe_migrate_clear(struct
>> xe_migrate *m,
>>   		batch_size = 2 +
>>   			pte_update_size(m, clear_vram, src, &src_it,
>>   					&clear_L0, &clear_L0_ofs,
>> &clear_L0_pt,
>> -					clear_system_ccs ? 0 :
>> emit_clear_cmd_len(gt), 0,
>> +					!clear_on_create ? 0 :
>> emit_clear_cmd_len(gt), 0,
>>   					avail_pts);
>>   
>>   		if (xe_device_has_flat_ccs(xe))
>> @@ -1060,7 +1061,7 @@ struct dma_fence *xe_migrate_clear(struct
>> xe_migrate *m,
>>   		bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
>>   		update_idx = bb->len;
>>   
>> -		if (!clear_system_ccs)
>> +		if (clear_on_create)
>>   			emit_clear(gt, bb, clear_L0_ofs, clear_L0,
>> XE_PAGE_SIZE, clear_vram);
>>   
>>   		if (xe_device_has_flat_ccs(xe)) {
diff mbox series

Patch

diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
index 65c696966e96..a9ce4347a7d7 100644
--- a/drivers/gpu/drm/xe/xe_bo.c
+++ b/drivers/gpu/drm/xe/xe_bo.c
@@ -399,6 +399,7 @@  static struct ttm_tt *xe_ttm_tt_create(struct ttm_buffer_object *ttm_bo,
 static int xe_ttm_tt_populate(struct ttm_device *ttm_dev, struct ttm_tt *tt,
 			      struct ttm_operation_ctx *ctx)
 {
+	struct xe_device *xe = ttm_to_xe_device(ttm_dev);
 	int err;
 
 	/*
@@ -408,6 +409,10 @@  static int xe_ttm_tt_populate(struct ttm_device *ttm_dev, struct ttm_tt *tt,
 	if (tt->page_flags & TTM_TT_FLAG_EXTERNAL)
 		return 0;
 
+	/* Clear TTM_TT_FLAG_ZERO_ALLOC when GPU is set to clear pages */
+	if (xe->mem.gpu_page_clear)
+		tt->page_flags &= ~TTM_TT_FLAG_ZERO_ALLOC;
+
 	err = ttm_pool_alloc(&ttm_dev->pool, tt, ctx);
 	if (err)
 		return err;
@@ -417,9 +422,15 @@  static int xe_ttm_tt_populate(struct ttm_device *ttm_dev, struct ttm_tt *tt,
 
 static void xe_ttm_tt_unpopulate(struct ttm_device *ttm_dev, struct ttm_tt *tt)
 {
+	struct xe_device *xe = ttm_to_xe_device(ttm_dev);
+
 	if (tt->page_flags & TTM_TT_FLAG_EXTERNAL)
 		return;
 
+	/* Add TTM_TT_FLAG_CLEARED_ON_FREE when GPU is set to clear pages */
+	if (xe->mem.gpu_page_clear)
+		tt->page_flags |= TTM_TT_FLAG_CLEARED_ON_FREE;
+
 	xe_tt_unmap_sg(tt);
 
 	return ttm_pool_free(&ttm_dev->pool, tt);
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index ca5e8435485a..c9afff1d0734 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -636,6 +636,13 @@  int xe_device_probe(struct xe_device *xe)
 	if (err)
 		goto err_irq_shutdown;
 
+	/**
+	 * On iGFX device with flat CCS, we clear CCS metadata, let's extend that
+	 * and use GPU to clear pages as well.
+	 */
+	if (xe_device_has_flat_ccs(xe) && !IS_DGFX(xe))
+		xe->mem.gpu_page_clear = true;
+
 	err = xe_vram_probe(xe);
 	if (err)
 		goto err_irq_shutdown;
diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
index c37be471d11c..ece68c6f3668 100644
--- a/drivers/gpu/drm/xe/xe_device_types.h
+++ b/drivers/gpu/drm/xe/xe_device_types.h
@@ -325,6 +325,8 @@  struct xe_device {
 		struct xe_mem_region vram;
 		/** @mem.sys_mgr: system TTM manager */
 		struct ttm_resource_manager sys_mgr;
+		/** @gpu_page_clear: clear pages offloaded to GPU */
+		bool gpu_page_clear;
 	} mem;
 
 	/** @sriov: device level virtualization data */
diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
index 05f933787860..61bf3d80e896 100644
--- a/drivers/gpu/drm/xe/xe_migrate.c
+++ b/drivers/gpu/drm/xe/xe_migrate.c
@@ -1003,6 +1003,7 @@  struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
 	struct xe_gt *gt = m->tile->primary_gt;
 	struct xe_device *xe = gt_to_xe(gt);
 	bool clear_system_ccs = (xe_bo_needs_ccs_pages(bo) && !IS_DGFX(xe)) ? true : false;
+	bool clear_on_create = xe->mem.gpu_page_clear || !clear_system_ccs;
 	struct dma_fence *fence = NULL;
 	u64 size = bo->size;
 	struct xe_res_cursor src_it;
@@ -1032,7 +1033,7 @@  struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
 		batch_size = 2 +
 			pte_update_size(m, clear_vram, src, &src_it,
 					&clear_L0, &clear_L0_ofs, &clear_L0_pt,
-					clear_system_ccs ? 0 : emit_clear_cmd_len(gt), 0,
+					!clear_on_create ? 0 : emit_clear_cmd_len(gt), 0,
 					avail_pts);
 
 		if (xe_device_has_flat_ccs(xe))
@@ -1060,7 +1061,7 @@  struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
 		bb->cs[bb->len++] = MI_BATCH_BUFFER_END;
 		update_idx = bb->len;
 
-		if (!clear_system_ccs)
+		if (clear_on_create)
 			emit_clear(gt, bb, clear_L0_ofs, clear_L0, XE_PAGE_SIZE, clear_vram);
 
 		if (xe_device_has_flat_ccs(xe)) {