[v5,4/4] drm/amdgpu: track bo memory stats at runtime

Message ID	20241018133308.889-5-Yunxiang.Li@amd.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C From: Yunxiang Li <Yunxiang.Li@amd.com> To: <dri-devel@lists.freedesktop.org>, <amd-gfx@lists.freedesktop.org> CC: <Alexander.Deucher@amd.com>, <christian.koenig@amd.com>, <tvrtko.ursulin@igalia.com>, Yunxiang Li <Yunxiang.Li@amd.com> Subject: [PATCH v5 4/4] drm/amdgpu: track bo memory stats at runtime Date: Fri, 18 Oct 2024 09:33:08 -0400 Message-ID: <20241018133308.889-5-Yunxiang.Li@amd.com> In-Reply-To: <20241018133308.889-1-Yunxiang.Li@amd.com> References: <20241018133308.889-1-Yunxiang.Li@amd.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: list Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	rework bo mem stats tracking \| expand [v5,0/4] rework bo mem stats tracking [v5,1/4] drm/amdgpu: remove unused function parameter [v5,2/4] drm/amdgpu: make drm-memory-* report resident memory [v5,3/4] drm/amdgpu: stop tracking visible memory stats [v5,4/4] drm/amdgpu: track bo memory stats at runtime

Yunxiang Li Oct. 18, 2024, 1:33 p.m. UTC

Before, every time fdinfo is queried we try to lock all the BOs in the
VM and calculate memory usage from scratch. This works okay if the
fdinfo is rarely read and the VMs don't have a ton of BOs. If either of
these conditions is not true, we get a massive performance hit.

In this new revision, we track the BOs as they change states. This way
when the fdinfo is queried we only need to take the status lock and copy
out the usage stats with minimal impact to the runtime performance.

Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c |  14 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c  |  11 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  |  82 +-------
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h  |   3 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c      | 204 ++++++++++++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h      |  13 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c   |   1 +
 drivers/gpu/drm/drm_file.c                  |   8 +
 include/drm/drm_file.h                      |   1 +
 9 files changed, 220 insertions(+), 117 deletions(-)

Christian König Oct. 22, 2024, 7:43 a.m. UTC | #1

Am 18.10.24 um 15:33 schrieb Yunxiang Li:
> Before, every time fdinfo is queried we try to lock all the BOs in the
> VM and calculate memory usage from scratch. This works okay if the
> fdinfo is rarely read and the VMs don't have a ton of BOs. If either of
> these conditions is not true, we get a massive performance hit.
>
> In this new revision, we track the BOs as they change states. This way
> when the fdinfo is queried we only need to take the status lock and copy
> out the usage stats with minimal impact to the runtime performance.
>
> Signed-off-by: Yunxiang Li <Yunxiang.Li@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c |  14 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c  |  11 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c  |  82 +-------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.h  |   3 -
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c      | 204 ++++++++++++++++++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h      |  13 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c   |   1 +
>   drivers/gpu/drm/drm_file.c                  |   8 +
>   include/drm/drm_file.h                      |   1 +
>   9 files changed, 220 insertions(+), 117 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> index b144404902255..1d8a0ff3c8604 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
> @@ -36,6 +36,7 @@
>   #include "amdgpu_gem.h"
>   #include "amdgpu_dma_buf.h"
>   #include "amdgpu_xgmi.h"
> +#include "amdgpu_vm.h"
>   #include <drm/amdgpu_drm.h>
>   #include <drm/ttm/ttm_tt.h>
>   #include <linux/dma-buf.h>
> @@ -190,6 +191,13 @@ static void amdgpu_dma_buf_unmap(struct dma_buf_attachment *attach,
>   	}
>   }
>   
> +static void amdgpu_dma_buf_release(struct dma_buf *buf)
> +{
> +	struct amdgpu_bo *bo = gem_to_amdgpu_bo(buf->priv);
> +	amdgpu_vm_bo_update_shared(bo, -1);
> +	drm_gem_dmabuf_release(buf);
> +}
> +
>   /**
>    * amdgpu_dma_buf_begin_cpu_access - &dma_buf_ops.begin_cpu_access implementation
>    * @dma_buf: Shared DMA buffer
> @@ -237,7 +245,7 @@ const struct dma_buf_ops amdgpu_dmabuf_ops = {
>   	.unpin = amdgpu_dma_buf_unpin,
>   	.map_dma_buf = amdgpu_dma_buf_map,
>   	.unmap_dma_buf = amdgpu_dma_buf_unmap,
> -	.release = drm_gem_dmabuf_release,
> +	.release = amdgpu_dma_buf_release,
>   	.begin_cpu_access = amdgpu_dma_buf_begin_cpu_access,
>   	.mmap = drm_gem_dmabuf_mmap,
>   	.vmap = drm_gem_dmabuf_vmap,
> @@ -265,8 +273,10 @@ struct dma_buf *amdgpu_gem_prime_export(struct drm_gem_object *gobj,
>   		return ERR_PTR(-EPERM);
>   
>   	buf = drm_gem_prime_export(gobj, flags);
> -	if (!IS_ERR(buf))
> +	if (!IS_ERR(buf)) {
>   		buf->ops = &amdgpu_dmabuf_ops;
> +		amdgpu_vm_bo_update_shared(bo, +1);
> +	}
>   
>   	return buf;
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
> index 7a9573958d87c..ceedfc3665c18 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fdinfo.c
> @@ -40,6 +40,7 @@
>   #include "amdgpu_gem.h"
>   #include "amdgpu_ctx.h"
>   #include "amdgpu_fdinfo.h"
> +#include "amdgpu_ttm.h"
>   
>   
>   static const char *amdgpu_ip_name[AMDGPU_HW_IP_NUM] = {
> @@ -60,7 +61,7 @@ void amdgpu_show_fdinfo(struct drm_printer *p, struct drm_file *file)
>   	struct amdgpu_fpriv *fpriv = file->driver_priv;
>   	struct amdgpu_vm *vm = &fpriv->vm;
>   
> -	struct amdgpu_mem_stats stats[__AMDGPU_PL_LAST + 1] = { };
> +	struct amdgpu_mem_stats stats[__AMDGPU_PL_LAST] = { };
>   	ktime_t usage[AMDGPU_HW_IP_NUM];
>   	const char *pl_name[] = {
>   		[TTM_PL_VRAM] = "vram",
> @@ -70,13 +71,7 @@ void amdgpu_show_fdinfo(struct drm_printer *p, struct drm_file *file)
>   	unsigned int hw_ip, i;
>   	int ret;
>   
> -	ret = amdgpu_bo_reserve(vm->root.bo, false);
> -	if (ret)
> -		return;
> -
> -	amdgpu_vm_get_memory(vm, stats, ARRAY_SIZE(stats));
> -	amdgpu_bo_unreserve(vm->root.bo);
> -
> +	amdgpu_vm_get_memory(vm, stats);
>   	amdgpu_ctx_mgr_usage(&fpriv->ctx_mgr, usage);
>   
>   	/*
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index 2436b7c9ad12b..5ff147881da6d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -1156,7 +1156,7 @@ void amdgpu_bo_move_notify(struct ttm_buffer_object *bo,
>   		return;
>   
>   	abo = ttm_to_amdgpu_bo(bo);
> -	amdgpu_vm_bo_invalidate(abo, evict);
> +	amdgpu_vm_bo_move(abo, new_mem, evict);
>   
>   	amdgpu_bo_kunmap(abo);
>   
> @@ -1169,86 +1169,6 @@ void amdgpu_bo_move_notify(struct ttm_buffer_object *bo,
>   			     old_mem ? old_mem->mem_type : -1);
>   }
>   
> -void amdgpu_bo_get_memory(struct amdgpu_bo *bo,
> -			  struct amdgpu_mem_stats *stats,
> -			  unsigned int sz)
> -{
> -	const unsigned int domain_to_pl[] = {
> -		[ilog2(AMDGPU_GEM_DOMAIN_CPU)]	    = TTM_PL_SYSTEM,
> -		[ilog2(AMDGPU_GEM_DOMAIN_GTT)]	    = TTM_PL_TT,
> -		[ilog2(AMDGPU_GEM_DOMAIN_VRAM)]	    = TTM_PL_VRAM,
> -		[ilog2(AMDGPU_GEM_DOMAIN_GDS)]	    = AMDGPU_PL_GDS,
> -		[ilog2(AMDGPU_GEM_DOMAIN_GWS)]	    = AMDGPU_PL_GWS,
> -		[ilog2(AMDGPU_GEM_DOMAIN_OA)]	    = AMDGPU_PL_OA,
> -		[ilog2(AMDGPU_GEM_DOMAIN_DOORBELL)] = AMDGPU_PL_DOORBELL,
> -	};
> -	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
> -	struct ttm_resource *res = bo->tbo.resource;
> -	struct drm_gem_object *obj = &bo->tbo.base;
> -	uint64_t size = amdgpu_bo_size(bo);
> -	unsigned int type;
> -
> -	if (!res) {
> -		/*
> -		 * If no backing store use one of the preferred domain for basic
> -		 * stats. We take the MSB since that should give a reasonable
> -		 * view.
> -		 */
> -		BUILD_BUG_ON(TTM_PL_VRAM < TTM_PL_TT ||
> -			     TTM_PL_VRAM < TTM_PL_SYSTEM);
> -		type = fls(bo->preferred_domains & AMDGPU_GEM_DOMAIN_MASK);
> -		if (!type)
> -			return;
> -		type--;
> -		if (drm_WARN_ON_ONCE(&adev->ddev,
> -				     type >= ARRAY_SIZE(domain_to_pl)))
> -			return;
> -		type = domain_to_pl[type];
> -	} else {
> -		type = res->mem_type;
> -	}
> -
> -	/* Squash some into 'cpu' to keep the legacy userspace view. */
> -	switch (type) {
> -	case TTM_PL_VRAM:
> -	case TTM_PL_TT:
> -	case TTM_PL_SYSTEM:
> -		break;
> -	default:
> -		type = TTM_PL_SYSTEM;
> -		break;
> -	}
> -
> -	if (drm_WARN_ON_ONCE(&adev->ddev, type >= sz))
> -		return;
> -
> -	/* DRM stats common fields: */
> -
> -	if (drm_gem_object_is_shared_for_memory_stats(obj))
> -		stats[type].drm.shared += size;
> -	else
> -		stats[type].drm.private += size;
> -
> -	if (res) {
> -		stats[type].drm.resident += size;
> -
> -		if (!dma_resv_test_signaled(obj->resv, DMA_RESV_USAGE_BOOKKEEP))
> -			stats[type].drm.active += size;
> -		else if (bo->flags & AMDGPU_GEM_CREATE_DISCARDABLE)
> -			stats[type].drm.purgeable += size;
> -	}
> -
> -	/* amdgpu specific stats: */
> -
> -	if (bo->preferred_domains & AMDGPU_GEM_DOMAIN_VRAM) {
> -		stats[TTM_PL_VRAM].requested += size;
> -		if (type != TTM_PL_VRAM)
> -			stats[TTM_PL_VRAM].evicted += size;
> -	} else if (bo->preferred_domains & AMDGPU_GEM_DOMAIN_GTT) {
> -		stats[TTM_PL_TT].requested += size;
> -	}
> -}
> -
>   /**
>    * amdgpu_bo_release_notify - notification about a BO being released
>    * @bo: pointer to a buffer object
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
> index be6769852ece4..ebad4f96775d9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
> @@ -300,9 +300,6 @@ int amdgpu_bo_sync_wait_resv(struct amdgpu_device *adev, struct dma_resv *resv,
>   int amdgpu_bo_sync_wait(struct amdgpu_bo *bo, void *owner, bool intr);
>   u64 amdgpu_bo_gpu_offset(struct amdgpu_bo *bo);
>   u64 amdgpu_bo_gpu_offset_no_check(struct amdgpu_bo *bo);
> -void amdgpu_bo_get_memory(struct amdgpu_bo *bo,
> -			  struct amdgpu_mem_stats *stats,
> -			  unsigned int size);
>   uint32_t amdgpu_bo_get_preferred_domain(struct amdgpu_device *adev,
>   					    uint32_t domain);
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> index 9fab64edd0530..a802cea67a4d7 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
> @@ -36,6 +36,7 @@
>   #include <drm/ttm/ttm_tt.h>
>   #include <drm/drm_exec.h>
>   #include "amdgpu.h"
> +#include "amdgpu_vm.h"
>   #include "amdgpu_trace.h"
>   #include "amdgpu_amdkfd.h"
>   #include "amdgpu_gmc.h"
> @@ -310,6 +311,134 @@ static void amdgpu_vm_bo_reset_state_machine(struct amdgpu_vm *vm)
>   	spin_unlock(&vm->status_lock);
>   }
>   
> +static uint32_t fold_memtype(uint32_t memtype) {

In general please add prefixes to even static functions, e.g. amdgpu_vm_ 
or amdgpu_bo_.

> +	/* Squash private placements into 'cpu' to keep the legacy userspace view. */
> +	switch (mem_type) {
> +	case TTM_PL_VRAM:
> +	case TTM_PL_TT:
> +		return memtype
> +	default:
> +		return TTM_PL_SYSTEM;
> +	}
> +}
> +
> +static uint32_t bo_get_memtype(struct amdgpu_bo *bo) {

That whole function belongs into amdgpu_bo.c

> +	struct ttm_resource *res = bo->tbo.resource;
> +	const uint32_t domain_to_pl[] = {
> +		[ilog2(AMDGPU_GEM_DOMAIN_CPU)]      = TTM_PL_SYSTEM,
> +		[ilog2(AMDGPU_GEM_DOMAIN_GTT)]      = TTM_PL_TT,
> +		[ilog2(AMDGPU_GEM_DOMAIN_VRAM)]     = TTM_PL_VRAM,
> +		[ilog2(AMDGPU_GEM_DOMAIN_GDS)]      = AMDGPU_PL_GDS,
> +		[ilog2(AMDGPU_GEM_DOMAIN_GWS)]      = AMDGPU_PL_GWS,
> +		[ilog2(AMDGPU_GEM_DOMAIN_OA)]       = AMDGPU_PL_OA,
> +		[ilog2(AMDGPU_GEM_DOMAIN_DOORBELL)] = AMDGPU_PL_DOORBELL,
> +	};
> +	uint32_t domain;
> +
> +	if (res)
> +		return fold_memtype(res->mem_type);
> +
> +	/*
> +	 * If no backing store use one of the preferred domain for basic
> +	 * stats. We take the MSB since that should give a reasonable
> +	 * view.
> +	 */
> +	BUILD_BUG_ON(TTM_PL_VRAM < TTM_PL_TT || TTM_PL_VRAM < TTM_PL_SYSTEM);
> +	domain = fls(bo->preferred_domains & AMDGPU_GEM_DOMAIN_MASK);
> +	if (drm_WARN_ON_ONCE(&adev->ddev,
> +			     domain == 0 || --domain >= ARRAY_SIZE(domain_to_pl)))

It's perfectly legal to create a BO without a placement. That one just 
won't have a backing store.

> +		return 0;
> +	return fold_memtype(domain_to_pl[domain])

That would need specular execution mitigation if I'm not completely 
mistaken.

Better use a switch/case statement.

> +}


> +
> +/**
> + * amdgpu_vm_update_shared - helper to update shared memory stat
> + * @base: base structure for tracking BO usage in a VM
> + * @sign: if we should add (+1) or subtract (-1) the memory stat
> + *
> + * Takes the vm status_lock and updates the shared memory stat. If the basic
> + * stat changed (e.g. buffer was moved) amdgpu_vm_update_stats need to be called
> + * as well.
> + */
> +static void amdgpu_vm_update_shared(struct amdgpu_vm_bo_base *base, int sign)
> +{
> +	struct amdgpu_vm *vm = base->vm;
> +	struct amdgpu_bo *bo = base->bo;
> +	int64_t size;
> +	int type;
> +
> +	if (!vm || !bo || !(sign == +1 || sign == -1))
> +		return;

Please drop such kind of checks.

> +
> +	spin_lock(&vm->status_lock);
> +	size = sign * amdgpu_bo_size(bo);
> +	type = bo_get_memtype(bo);
> +	vm->stats[type].drm.shared += size;
> +	vm->stats[type].drm.private -= size;
> +	spin_unlock(&vm->status_lock);
> +}
> +
> +/**
> + * amdgpu_vm_update_stats - helper to update normal memory stat
> + * @base: base structure for tracking BO usage in a VM
> + * @new_mem: the new placement of the BO if any (e.g. NULL when BO is deleted)
> + * @old_mem: the old placement of the BO if any (e.g. NULL when BO is created)
> + *
> + * Takes the vm status_lock and updates the basic memory stat. If the shared
> + * stat changed (e.g. buffer was exported) amdgpu_vm_update_shared need to be
> + * called as well.
> + */
> +void amdgpu_vm_update_stats(struct amdgpu_vm_bo_base *base,
> +			    struct ttm_resource *new_mem,
> +			    struct ttm_resource *old_mem)
> +{
> +	struct amdgpu_vm *vm = base->vm;
> +	struct amdgpu_bo *bo = base->bo;
> +	uint64_t size;
> +	int type;
> +	bool shared;
> +
> +	if (!vm || !bo || (!new_mem && !old_mem))
> +		return;
> +
> +	spin_lock(&vm->status_lock);
> +
> +	size = amdgpu_bo_size(bo);
> +	shared = drm_gem_object_is_shared_for_memory_stats(&bo->tbo.base);

That should probably be outside of the spinlock.

> +
> +	if (old_mem) {
> +		type = fold_memtype(old_mem->mem_type);
> +		if (shared)
> +			vm->stats[i].drm.shared -= size;
> +		else
> +			vm->stats[i].drm.private -= size;
> +	}
> +	if (new_mem) {
> +		type = fold_memtype(new_mem->mem_type);
> +		if (shared)
> +			vm->stats[i].drm.shared += size;
> +		else
> +			vm->stats[i].drm.private += size;
> +	}
> +	if (bo->preferred_domains & AMDGPU_GEM_DOMAIN_VRAM) {
> +		if (!old_mem)
> +			vm->stats[TTM_PL_VRAM].requested += size;
> +		else if (old_mem->mem_type != TTM_PL_VRAM)
> +			vm->stats[TTM_PL_VRAM].evicted -= size;
> +		if (!new_mem)
> +			vm->stats[TTM_PL_VRAM].requested -= size;
> +		else if (new_mem->mem_type != TTM_PL_VRAM)
> +			vm->stats[TTM_PL_VRAM].evicted += size;
> +	} else if (bo->preferred_domains & AMDGPU_GEM_DOMAIN_GTT) {
> +		if (!old_mem)
> +			vm->stats[TTM_PL_TT].requested += size;
> +		if (!new_mem)
> +			vm->stats[TTM_PL_TT].requested -= size;
> +	}
> +
> +	spin_unlock(&vm->status_lock);
> +}
> +
>   /**
>    * amdgpu_vm_bo_base_init - Adds bo to the list of bos associated with the vm
>    *
> @@ -332,6 +461,7 @@ void amdgpu_vm_bo_base_init(struct amdgpu_vm_bo_base *base,
>   		return;
>   	base->next = bo->vm_bo;
>   	bo->vm_bo = base;
> +	amdgpu_vm_update_stats(base, bo->tbo.resource, NULL);
>   
>   	if (!amdgpu_vm_is_bo_always_valid(vm, bo))
>   		return;
> @@ -1106,29 +1236,10 @@ static void amdgpu_vm_bo_get_memory(struct amdgpu_bo_va *bo_va,
>   }
>   
>   void amdgpu_vm_get_memory(struct amdgpu_vm *vm,
> -			  struct amdgpu_mem_stats *stats,
> -			  unsigned int size)
> +			  struct amdgpu_mem_stats stats[__AMDGPU_PL_LAST])
>   {
> -	struct amdgpu_bo_va *bo_va, *tmp;
> -
>   	spin_lock(&vm->status_lock);
> -	list_for_each_entry_safe(bo_va, tmp, &vm->idle, base.vm_status)
> -		amdgpu_vm_bo_get_memory(bo_va, stats, size);
> -
> -	list_for_each_entry_safe(bo_va, tmp, &vm->evicted, base.vm_status)
> -		amdgpu_vm_bo_get_memory(bo_va, stats, size);
> -
> -	list_for_each_entry_safe(bo_va, tmp, &vm->relocated, base.vm_status)
> -		amdgpu_vm_bo_get_memory(bo_va, stats, size);
> -
> -	list_for_each_entry_safe(bo_va, tmp, &vm->moved, base.vm_status)
> -		amdgpu_vm_bo_get_memory(bo_va, stats, size);
> -
> -	list_for_each_entry_safe(bo_va, tmp, &vm->invalidated, base.vm_status)
> -		amdgpu_vm_bo_get_memory(bo_va, stats, size);
> -
> -	list_for_each_entry_safe(bo_va, tmp, &vm->done, base.vm_status)
> -		amdgpu_vm_bo_get_memory(bo_va, stats, size);
> +	memcpy(stats, vm->stats, sizeof(*stats) * __AMDGPU_PL_LAST);
>   	spin_unlock(&vm->status_lock);
>   }
>   
> @@ -2071,6 +2182,7 @@ void amdgpu_vm_bo_del(struct amdgpu_device *adev,
>   			if (*base != &bo_va->base)
>   				continue;
>   
> +			amdgpu_vm_update_stats(*base, NULL, bo->tbo.resource);
>   			*base = bo_va->base.next;
>   			break;
>   		}
> @@ -2136,6 +2248,22 @@ bool amdgpu_vm_evictable(struct amdgpu_bo *bo)
>   	return true;
>   }
>   
> +/**
> + * amdgpu_vm_bo_update_shared - called when bo gets shared/unshared
> + *
> + * @bo: amdgpu buffer object
> + * @sign: if we should add (+1) or subtract (-1) the memory stat
> + *
> + * Update the per VM stats for all the vm
> + */
> +void amdgpu_vm_bo_update_shared(struct amdgpu_bo *bo, int sign)
> +{
> +	struct amdgpu_vm_bo_base *bo_base;
> +
> +	for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next)
> +		amdgpu_vm_update_shared(bo_base, sign);
> +}
> +
>   /**
>    * amdgpu_vm_bo_invalidate - mark the bo as invalid
>    *
> @@ -2169,6 +2297,26 @@ void amdgpu_vm_bo_invalidate(struct amdgpu_bo *bo, bool evicted)
>   	}
>   }
>   
> +/**
> + * amdgpu_vm_bo_move - handle BO move
> + *
> + * @bo: amdgpu buffer object
> + * @new_mem: the new placement of the BO move
> + * @evicted: is the BO evicted
> + *
> + * Update the memory stats for the new placement and mark @bo as invalid.
> + */
> +void amdgpu_vm_bo_move(struct amdgpu_bo *bo, struct ttm_resource *new_mem,
> +		       bool evicted)
> +{
> +	struct amdgpu_vm_bo_base *bo_base;
> +
> +	for (bo_base = bo->vm_bo; bo_base; bo_base = bo_base->next)
> +		amdgpu_vm_update_stats(bo_base, new_mem, bo->tbo.resource);
> +
> +	amdgpu_vm_bo_invalidate(bo, evicted);
> +}
> +
>   /**
>    * amdgpu_vm_get_block_size - calculate VM page table size as power of two
>    *
> @@ -2585,6 +2733,18 @@ void amdgpu_vm_release_compute(struct amdgpu_device *adev, struct amdgpu_vm *vm)
>   	vm->is_compute_context = false;
>   }
>   
> +static int amdgpu_vm_stats_is_zero(struct amdgpu_vm *vm)
> +{
> +	int is_zero = 1;
> +	for (int i = 0; i < __AMDGPU_PL_LAST, ++i) {
> +		if (!(is_zero = is_zero &&
> +				drm_memory_stats_is_zero(&vm->stats[i].drm) &&
> +				stats->evicted == 0 && stats->requested == 0))

The indentation here looks completely off.

> +			break;
> +	}
> +	return is_zero;
> +}
> +
>   /**
>    * amdgpu_vm_fini - tear down a vm instance
>    *
> @@ -2656,6 +2816,8 @@ void amdgpu_vm_fini(struct amdgpu_device *adev, struct amdgpu_vm *vm)
>   		}
>   	}
>   
> +	if (!amdgpu_vm_stats_is_zero(vm))
> +		dev_err(adev->dev, "VM memory stats is non-zero when fini\n");
>   }
>   
>   /**
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> index 6a1b344e15e1b..7b3cd6367969d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> @@ -24,6 +24,7 @@
>   #ifndef __AMDGPU_VM_H__
>   #define __AMDGPU_VM_H__
>   
> +#include "amdgpu_ttm.h"
>   #include <linux/idr.h>
>   #include <linux/kfifo.h>
>   #include <linux/rbtree.h>
> @@ -345,6 +346,9 @@ struct amdgpu_vm {
>   	/* Lock to protect vm_bo add/del/move on all lists of vm */
>   	spinlock_t		status_lock;
>   
> +	/* Memory statistics for this vm, protected by the status_lock */
> +	struct amdgpu_mem_stats stats[__AMDGPU_PL_LAST];
> +
>   	/* Per-VM and PT BOs who needs a validation */
>   	struct list_head	evicted;
>   
> @@ -525,6 +529,12 @@ int amdgpu_vm_bo_update(struct amdgpu_device *adev,
>   			bool clear);
>   bool amdgpu_vm_evictable(struct amdgpu_bo *bo);
>   void amdgpu_vm_bo_invalidate(struct amdgpu_bo *bo, bool evicted);
> +void amdgpu_vm_update_stats(struct amdgpu_vm_bo_base *base,
> +			    struct ttm_resource *new_mem,
> +			    struct ttm_resource *old_mem);
> +void amdgpu_vm_bo_update_shared(struct amdgpu_bo *bo, int sign);
> +void amdgpu_vm_bo_move(struct amdgpu_bo *bo, struct ttm_resource *new_mem,
> +		       bool evicted);
>   uint64_t amdgpu_vm_map_gart(const dma_addr_t *pages_addr, uint64_t addr);
>   struct amdgpu_bo_va *amdgpu_vm_bo_find(struct amdgpu_vm *vm,
>   				       struct amdgpu_bo *bo);
> @@ -575,8 +585,7 @@ void amdgpu_vm_set_task_info(struct amdgpu_vm *vm);
>   void amdgpu_vm_move_to_lru_tail(struct amdgpu_device *adev,
>   				struct amdgpu_vm *vm);
>   void amdgpu_vm_get_memory(struct amdgpu_vm *vm,
> -			  struct amdgpu_mem_stats *stats,
> -			  unsigned int size);
> +			  struct amdgpu_mem_stats stats[__AMDGPU_PL_LAST]);
>   
>   int amdgpu_vm_pt_clear(struct amdgpu_device *adev, struct amdgpu_vm *vm,
>   		       struct amdgpu_bo_vm *vmbo, bool immediate);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c
> index f78a0434a48fa..bd57ced911e32 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm_pt.c
> @@ -537,6 +537,7 @@ static void amdgpu_vm_pt_free(struct amdgpu_vm_bo_base *entry)
>   	if (!entry->bo)
>   		return;
>   
> +	amdgpu_vm_update_stats(entry, NULL, entry->bo->tbo.resource);
>   	entry->bo->vm_bo = NULL;
>   	ttm_bo_set_bulk_move(&entry->bo->tbo, NULL);
>   
> diff --git a/drivers/gpu/drm/drm_file.c b/drivers/gpu/drm/drm_file.c
> index 714e42b051080..39e36fa1e89cd 100644
> --- a/drivers/gpu/drm/drm_file.c
> +++ b/drivers/gpu/drm/drm_file.c
> @@ -859,6 +859,14 @@ static void print_size(struct drm_printer *p, const char *stat,
>   	drm_printf(p, "drm-%s-%s:\t%llu%s\n", stat, region, sz, units[u]);
>   }
>   
> +int drm_memory_stats_is_zero(const struct drm_memory_stats *stats) {
> +	return (stats->shared == 0 &&
> +		stats->private == 0 &&
> +		stats->resident == 0 &&
> +		stats->purgeable == 0 &&
> +		stats->active == 0);
> +}
> +

That needs a separate patch and review on the dri-devel mailing list.

Regards,
Christian.

>   /**
>    * drm_print_memory_stats - A helper to print memory stats
>    * @p: The printer to print output to
> diff --git a/include/drm/drm_file.h b/include/drm/drm_file.h
> index ab230d3af138d..7f91e35d027d9 100644
> --- a/include/drm/drm_file.h
> +++ b/include/drm/drm_file.h
> @@ -477,6 +477,7 @@ struct drm_memory_stats {
>   
>   enum drm_gem_object_status;
>   
> +int drm_memory_stats_is_zero(const struct drm_memory_stats *stats);
>   void drm_print_memory_stats(struct drm_printer *p,
>   			    const struct drm_memory_stats *stats,
>   			    enum drm_gem_object_status supported_status,

Yunxiang Li Oct. 22, 2024, 3:17 p.m. UTC | #2

[Public]

> >
> > +static uint32_t fold_memtype(uint32_t memtype) {
>
> In general please add prefixes to even static functions, e.g. amdgpu_vm_ or
> amdgpu_bo_.
>
> > +   /* Squash private placements into 'cpu' to keep the legacy userspace view.
> */
> > +   switch (mem_type) {
> > +   case TTM_PL_VRAM:
> > +   case TTM_PL_TT:
> > +           return memtype
> > +   default:
> > +           return TTM_PL_SYSTEM;
> > +   }
> > +}
> > +
> > +static uint32_t bo_get_memtype(struct amdgpu_bo *bo) {
>
> That whole function belongs into amdgpu_bo.c

Do you mean bo_get_memtype or fold_memtype? I debated whether bo_get_memtype should go into amdgpu_vm.c or amdgpu_bo.c, and since it's using fold_memtype and only useful for memory stats because of folding the private placements I just left them here together with the other mem stats code.

I can move it to amdgpu_bo.c make it return the memtype verbatim and just fold it when I do the accounting.

>
> > +   struct ttm_resource *res = bo->tbo.resource;
> > +   const uint32_t domain_to_pl[] = {
> > +           [ilog2(AMDGPU_GEM_DOMAIN_CPU)]      = TTM_PL_SYSTEM,
> > +           [ilog2(AMDGPU_GEM_DOMAIN_GTT)]      = TTM_PL_TT,
> > +           [ilog2(AMDGPU_GEM_DOMAIN_VRAM)]     = TTM_PL_VRAM,
> > +           [ilog2(AMDGPU_GEM_DOMAIN_GDS)]      = AMDGPU_PL_GDS,
> > +           [ilog2(AMDGPU_GEM_DOMAIN_GWS)]      = AMDGPU_PL_GWS,
> > +           [ilog2(AMDGPU_GEM_DOMAIN_OA)]       = AMDGPU_PL_OA,
> > +           [ilog2(AMDGPU_GEM_DOMAIN_DOORBELL)] =
> AMDGPU_PL_DOORBELL,
> > +   };
> > +   uint32_t domain;
> > +
> > +   if (res)
> > +           return fold_memtype(res->mem_type);
> > +
> > +   /*
> > +    * If no backing store use one of the preferred domain for basic
> > +    * stats. We take the MSB since that should give a reasonable
> > +    * view.
> > +    */
> > +   BUILD_BUG_ON(TTM_PL_VRAM < TTM_PL_TT || TTM_PL_VRAM <
> TTM_PL_SYSTEM);
> > +   domain = fls(bo->preferred_domains & AMDGPU_GEM_DOMAIN_MASK);
> > +   if (drm_WARN_ON_ONCE(&adev->ddev,
> > +                        domain == 0 || --domain >= ARRAY_SIZE(domain_to_pl)))
>
> It's perfectly legal to create a BO without a placement. That one just won't have a
> backing store.
>

This is lifted from the previous change I'm rebasing onto. I think what it’s trying to do is if the BO doesn't have a placement, use the "biggest" (VRAM > TT > SYSTEM) preferred placement for the purpose of accounting. Previously we just ignore BOs that doesn't have a placement. I guess there's argument for going with either approaches.

> > +           return 0;
> > +   return fold_memtype(domain_to_pl[domain])
>
> That would need specular execution mitigation if I'm not completely mistaken.
>
> Better use a switch/case statement.
>

Do you mean change the array indexing to a switch statement?


Regards,
Teddy

Christian König Oct. 22, 2024, 4:24 p.m. UTC | #3

Am 22.10.24 um 17:17 schrieb Li, Yunxiang (Teddy):
> [Public]
>
>>> +static uint32_t fold_memtype(uint32_t memtype) {
>> In general please add prefixes to even static functions, e.g. amdgpu_vm_ or
>> amdgpu_bo_.
>>
>>> +   /* Squash private placements into 'cpu' to keep the legacy userspace view.
>> */
>>> +   switch (mem_type) {
>>> +   case TTM_PL_VRAM:
>>> +   case TTM_PL_TT:
>>> +           return memtype
>>> +   default:
>>> +           return TTM_PL_SYSTEM;
>>> +   }
>>> +}
>>> +
>>> +static uint32_t bo_get_memtype(struct amdgpu_bo *bo) {
>> That whole function belongs into amdgpu_bo.c
> Do you mean bo_get_memtype or fold_memtype? I debated whether bo_get_memtype should go into amdgpu_vm.c or amdgpu_bo.c, and since it's using fold_memtype and only useful for memory stats because of folding the private placements I just left them here together with the other mem stats code.
>
> I can move it to amdgpu_bo.c make it return the memtype verbatim and just fold it when I do the accounting.

I think that folding GDS, GWS and OA into system is also a bug. We 
should really not doing that.

Just wanted to point out for this round that the code to query the 
current placement from a BO should probably go into amdgpu_bo.c and not 
amdgpu_vm.c

>
>>> +   struct ttm_resource *res = bo->tbo.resource;
>>> +   const uint32_t domain_to_pl[] = {
>>> +           [ilog2(AMDGPU_GEM_DOMAIN_CPU)]      = TTM_PL_SYSTEM,
>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GTT)]      = TTM_PL_TT,
>>> +           [ilog2(AMDGPU_GEM_DOMAIN_VRAM)]     = TTM_PL_VRAM,
>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GDS)]      = AMDGPU_PL_GDS,
>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GWS)]      = AMDGPU_PL_GWS,
>>> +           [ilog2(AMDGPU_GEM_DOMAIN_OA)]       = AMDGPU_PL_OA,
>>> +           [ilog2(AMDGPU_GEM_DOMAIN_DOORBELL)] =
>> AMDGPU_PL_DOORBELL,
>>> +   };
>>> +   uint32_t domain;
>>> +
>>> +   if (res)
>>> +           return fold_memtype(res->mem_type);
>>> +
>>> +   /*
>>> +    * If no backing store use one of the preferred domain for basic
>>> +    * stats. We take the MSB since that should give a reasonable
>>> +    * view.
>>> +    */
>>> +   BUILD_BUG_ON(TTM_PL_VRAM < TTM_PL_TT || TTM_PL_VRAM <
>> TTM_PL_SYSTEM);
>>> +   domain = fls(bo->preferred_domains & AMDGPU_GEM_DOMAIN_MASK);
>>> +   if (drm_WARN_ON_ONCE(&adev->ddev,
>>> +                        domain == 0 || --domain >= ARRAY_SIZE(domain_to_pl)))
>> It's perfectly legal to create a BO without a placement. That one just won't have a
>> backing store.
>>
> This is lifted from the previous change I'm rebasing onto. I think what it’s trying to do is if the BO doesn't have a placement, use the "biggest" (VRAM > TT > SYSTEM) preferred placement for the purpose of accounting. Previously we just ignore BOs that doesn't have a placement. I guess there's argument for going with either approaches.

I was not arguing, I'm simply pointing out a bug. It's perfectly valid 
for bo->preferred_domains to be 0.

So the following WARN_ON() that no bit is set is incorrect.

>
>>> +           return 0;
>>> +   return fold_memtype(domain_to_pl[domain])
>> That would need specular execution mitigation if I'm not completely mistaken.
>>
>> Better use a switch/case statement.
>>
> Do you mean change the array indexing to a switch statement?

Yes.

Regards,
Christian.

>
>
> Regards,
> Teddy

Yunxiang Li Oct. 22, 2024, 4:46 p.m. UTC | #4

[Public]

I suppose we could add a field like amd-memory-private: to cover the private placements. When would a BO not have a placement, is it when it is being moved? Since we are tracking the state changes, I wonder if such situations can be avoided now so whenever we call these stat update functions the BO would always have a placement.

Teddy

Christian König Oct. 22, 2024, 5:06 p.m. UTC | #5

Am 22.10.24 um 18:46 schrieb Li, Yunxiang (Teddy):
> [Public]
>
> I suppose we could add a field like amd-memory-private: to cover the private placements.

No, that is not really appropriate either. GWS, GDS and OA are not 
memory in the first place.

Those BOs are HW blocks which the driver allocated to use.

So accounting them for the memory usage doesn't make any sense at all.

We could print them in the fdinfo as something special for statistics, 
but it's probably not that useful.

>   When would a BO not have a placement, is it when it is being moved?

There are BOs which are only temporary, so when they are evicted their 
backing store is just discarded.

Additional to that allocation of backing store is sometimes delayed 
until the first use.

>   Since we are tracking the state changes, I wonder if such situations can be avoided now so whenever we call these stat update functions the BO would always have a placement.

No, as I said before those use cases are perfectly valid. BO don't need 
a backing store nor do they need a placement.

So the code has to gracefully handle that.

Regards,
Christian.

>
> Teddy

Yunxiang Li Oct. 22, 2024, 5:09 p.m. UTC | #6

[AMD Official Use Only - AMD Internal Distribution Only]

It sounds like it makes the most sense to ignore the BOs that have no placement or private placements then, it would simplify the code too.

Teddy

Christian König Oct. 23, 2024, 6:34 a.m. UTC | #7

Am 22.10.24 um 19:09 schrieb Li, Yunxiang (Teddy):
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> It sounds like it makes the most sense to ignore the BOs that have no placement or private placements then, it would simplify the code too.

Yeah, that works for me.

Regards,
Christian.

>
> Teddy

Tvrtko Ursulin Oct. 23, 2024, 7:33 a.m. UTC | #8

On 22/10/2024 18:06, Christian König wrote:
> Am 22.10.24 um 18:46 schrieb Li, Yunxiang (Teddy):
>> [Public]
>>
>> I suppose we could add a field like amd-memory-private: to cover the 
>> private placements.
> 
> No, that is not really appropriate either. GWS, GDS and OA are not 
> memory in the first place.
> 
> Those BOs are HW blocks which the driver allocated to use.
> 
> So accounting them for the memory usage doesn't make any sense at all.
> 
> We could print them in the fdinfo as something special for statistics, 
> but it's probably not that useful.
> 
>>   When would a BO not have a placement, is it when it is being moved?
> 
> There are BOs which are only temporary, so when they are evicted their 
> backing store is just discarded.
> 
> Additional to that allocation of backing store is sometimes delayed 
> until the first use.

Would this work correctly if instead of preferred allowed mask was used?

Point being, to correctly support fdinfo stats drm-total-, *if* a BO 
*can* have a backing store at any point it should always be counted there.

*If* it currently has a placement it is drm-resident-.

If it has a placement but can be discarded it is drm-purgeable-. Etc.

Regards,

Tvrtko

> 
>>   Since we are tracking the state changes, I wonder if such situations 
>> can be avoided now so whenever we call these stat update functions the 
>> BO would always have a placement.
> 
> No, as I said before those use cases are perfectly valid. BO don't need 
> a backing store nor do they need a placement.
> 
> So the code has to gracefully handle that.
> 
> Regards,
> Christian.
> 
>>
>> Teddy
>

Tvrtko Ursulin Oct. 23, 2024, 7:38 a.m. UTC | #9

On 22/10/2024 17:24, Christian König wrote:
> Am 22.10.24 um 17:17 schrieb Li, Yunxiang (Teddy):
>> [Public]
>>
>>>> +static uint32_t fold_memtype(uint32_t memtype) {
>>> In general please add prefixes to even static functions, e.g. 
>>> amdgpu_vm_ or
>>> amdgpu_bo_.
>>>
>>>> +   /* Squash private placements into 'cpu' to keep the legacy 
>>>> userspace view.
>>> */
>>>> +   switch (mem_type) {
>>>> +   case TTM_PL_VRAM:
>>>> +   case TTM_PL_TT:
>>>> +           return memtype
>>>> +   default:
>>>> +           return TTM_PL_SYSTEM;
>>>> +   }
>>>> +}
>>>> +
>>>> +static uint32_t bo_get_memtype(struct amdgpu_bo *bo) {
>>> That whole function belongs into amdgpu_bo.c
>> Do you mean bo_get_memtype or fold_memtype? I debated whether 
>> bo_get_memtype should go into amdgpu_vm.c or amdgpu_bo.c, and since 
>> it's using fold_memtype and only useful for memory stats because of 
>> folding the private placements I just left them here together with the 
>> other mem stats code.
>>
>> I can move it to amdgpu_bo.c make it return the memtype verbatim and 
>> just fold it when I do the accounting.
> 
> I think that folding GDS, GWS and OA into system is also a bug. We 
> should really not doing that.
> 
> Just wanted to point out for this round that the code to query the 
> current placement from a BO should probably go into amdgpu_bo.c and not 
> amdgpu_vm.c
> 
>>
>>>> +   struct ttm_resource *res = bo->tbo.resource;
>>>> +   const uint32_t domain_to_pl[] = {
>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_CPU)]      = TTM_PL_SYSTEM,
>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GTT)]      = TTM_PL_TT,
>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_VRAM)]     = TTM_PL_VRAM,
>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GDS)]      = AMDGPU_PL_GDS,
>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GWS)]      = AMDGPU_PL_GWS,
>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_OA)]       = AMDGPU_PL_OA,
>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_DOORBELL)] =
>>> AMDGPU_PL_DOORBELL,
>>>> +   };
>>>> +   uint32_t domain;
>>>> +
>>>> +   if (res)
>>>> +           return fold_memtype(res->mem_type);
>>>> +
>>>> +   /*
>>>> +    * If no backing store use one of the preferred domain for basic
>>>> +    * stats. We take the MSB since that should give a reasonable
>>>> +    * view.
>>>> +    */
>>>> +   BUILD_BUG_ON(TTM_PL_VRAM < TTM_PL_TT || TTM_PL_VRAM <
>>> TTM_PL_SYSTEM);
>>>> +   domain = fls(bo->preferred_domains & AMDGPU_GEM_DOMAIN_MASK);
>>>> +   if (drm_WARN_ON_ONCE(&adev->ddev,
>>>> +                        domain == 0 || --domain >= 
>>>> ARRAY_SIZE(domain_to_pl)))
>>> It's perfectly legal to create a BO without a placement. That one 
>>> just won't have a
>>> backing store.
>>>
>> This is lifted from the previous change I'm rebasing onto. I think 
>> what it’s trying to do is if the BO doesn't have a placement, use the 
>> "biggest" (VRAM > TT > SYSTEM) preferred placement for the purpose of 
>> accounting. Previously we just ignore BOs that doesn't have a 
>> placement. I guess there's argument for going with either approaches.
> 
> I was not arguing, I'm simply pointing out a bug. It's perfectly valid 
> for bo->preferred_domains to be 0.
> 
> So the following WARN_ON() that no bit is set is incorrect.
> 
>>
>>>> +           return 0;
>>>> +   return fold_memtype(domain_to_pl[domain])
>>> That would need specular execution mitigation if I'm not completely 
>>> mistaken.
>>>
>>> Better use a switch/case statement.
>>>
>> Do you mean change the array indexing to a switch statement?
> 
> Yes.

Did you mean array_index_nospec? Domain is not a direct userspace input 
and is calculated from the mask which sanitized to allowed values prior 
to this call. So I *think* switch is an overkill but don't mind it 
either. Just commenting FWIW.

Regards,

Tvrtko

Christian König Oct. 23, 2024, 9:14 a.m. UTC | #10

Am 23.10.24 um 09:38 schrieb Tvrtko Ursulin:
>
> On 22/10/2024 17:24, Christian König wrote:
>> Am 22.10.24 um 17:17 schrieb Li, Yunxiang (Teddy):
>>> [Public]
>>>
>>>>> +static uint32_t fold_memtype(uint32_t memtype) {
>>>> In general please add prefixes to even static functions, e.g. 
>>>> amdgpu_vm_ or
>>>> amdgpu_bo_.
>>>>
>>>>> +   /* Squash private placements into 'cpu' to keep the legacy 
>>>>> userspace view.
>>>> */
>>>>> +   switch (mem_type) {
>>>>> +   case TTM_PL_VRAM:
>>>>> +   case TTM_PL_TT:
>>>>> +           return memtype
>>>>> +   default:
>>>>> +           return TTM_PL_SYSTEM;
>>>>> +   }
>>>>> +}
>>>>> +
>>>>> +static uint32_t bo_get_memtype(struct amdgpu_bo *bo) {
>>>> That whole function belongs into amdgpu_bo.c
>>> Do you mean bo_get_memtype or fold_memtype? I debated whether 
>>> bo_get_memtype should go into amdgpu_vm.c or amdgpu_bo.c, and since 
>>> it's using fold_memtype and only useful for memory stats because of 
>>> folding the private placements I just left them here together with 
>>> the other mem stats code.
>>>
>>> I can move it to amdgpu_bo.c make it return the memtype verbatim and 
>>> just fold it when I do the accounting.
>>
>> I think that folding GDS, GWS and OA into system is also a bug. We 
>> should really not doing that.
>>
>> Just wanted to point out for this round that the code to query the 
>> current placement from a BO should probably go into amdgpu_bo.c and 
>> not amdgpu_vm.c
>>
>>>
>>>>> +   struct ttm_resource *res = bo->tbo.resource;
>>>>> +   const uint32_t domain_to_pl[] = {
>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_CPU)]      = TTM_PL_SYSTEM,
>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GTT)]      = TTM_PL_TT,
>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_VRAM)]     = TTM_PL_VRAM,
>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GDS)]      = AMDGPU_PL_GDS,
>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GWS)]      = AMDGPU_PL_GWS,
>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_OA)]       = AMDGPU_PL_OA,
>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_DOORBELL)] =
>>>> AMDGPU_PL_DOORBELL,
>>>>> +   };
>>>>> +   uint32_t domain;
>>>>> +
>>>>> +   if (res)
>>>>> +           return fold_memtype(res->mem_type);
>>>>> +
>>>>> +   /*
>>>>> +    * If no backing store use one of the preferred domain for basic
>>>>> +    * stats. We take the MSB since that should give a reasonable
>>>>> +    * view.
>>>>> +    */
>>>>> +   BUILD_BUG_ON(TTM_PL_VRAM < TTM_PL_TT || TTM_PL_VRAM <
>>>> TTM_PL_SYSTEM);
>>>>> +   domain = fls(bo->preferred_domains & AMDGPU_GEM_DOMAIN_MASK);
>>>>> +   if (drm_WARN_ON_ONCE(&adev->ddev,
>>>>> +                        domain == 0 || --domain >= 
>>>>> ARRAY_SIZE(domain_to_pl)))
>>>> It's perfectly legal to create a BO without a placement. That one 
>>>> just won't have a
>>>> backing store.
>>>>
>>> This is lifted from the previous change I'm rebasing onto. I think 
>>> what it’s trying to do is if the BO doesn't have a placement, use 
>>> the "biggest" (VRAM > TT > SYSTEM) preferred placement for the 
>>> purpose of accounting. Previously we just ignore BOs that doesn't 
>>> have a placement. I guess there's argument for going with either 
>>> approaches.
>>
>> I was not arguing, I'm simply pointing out a bug. It's perfectly 
>> valid for bo->preferred_domains to be 0.
>>
>> So the following WARN_ON() that no bit is set is incorrect.
>>
>>>
>>>>> +           return 0;
>>>>> +   return fold_memtype(domain_to_pl[domain])
>>>> That would need specular execution mitigation if I'm not completely 
>>>> mistaken.
>>>>
>>>> Better use a switch/case statement.
>>>>
>>> Do you mean change the array indexing to a switch statement?
>>
>> Yes.
>
> Did you mean array_index_nospec?

Yes.

> Domain is not a direct userspace input and is calculated from the mask 
> which sanitized to allowed values prior to this call. So I *think* 
> switch is an overkill but don't mind it either. Just commenting FWIW.

I missed that the mask is applied.

Thinking more about it I'm not sure if we should do this conversion in 
the first place. IIRC Tvrtko you once suggested a patch which switched a 
bunch of code to use the TTM placement instead of the UAPI flags.

Going more into this direction I think when we want to look at the 
current placement we should probably also use the TTM PL enumeration 
directly.

Regards,
Christian.

>
> Regards,
>
> Tvrtko

Tvrtko Ursulin Oct. 23, 2024, 11:37 a.m. UTC | #11

On 23/10/2024 10:14, Christian König wrote:
> Am 23.10.24 um 09:38 schrieb Tvrtko Ursulin:
>>
>> On 22/10/2024 17:24, Christian König wrote:
>>> Am 22.10.24 um 17:17 schrieb Li, Yunxiang (Teddy):
>>>> [Public]
>>>>
>>>>>> +static uint32_t fold_memtype(uint32_t memtype) {
>>>>> In general please add prefixes to even static functions, e.g. 
>>>>> amdgpu_vm_ or
>>>>> amdgpu_bo_.
>>>>>
>>>>>> +   /* Squash private placements into 'cpu' to keep the legacy 
>>>>>> userspace view.
>>>>> */
>>>>>> +   switch (mem_type) {
>>>>>> +   case TTM_PL_VRAM:
>>>>>> +   case TTM_PL_TT:
>>>>>> +           return memtype
>>>>>> +   default:
>>>>>> +           return TTM_PL_SYSTEM;
>>>>>> +   }
>>>>>> +}
>>>>>> +
>>>>>> +static uint32_t bo_get_memtype(struct amdgpu_bo *bo) {
>>>>> That whole function belongs into amdgpu_bo.c
>>>> Do you mean bo_get_memtype or fold_memtype? I debated whether 
>>>> bo_get_memtype should go into amdgpu_vm.c or amdgpu_bo.c, and since 
>>>> it's using fold_memtype and only useful for memory stats because of 
>>>> folding the private placements I just left them here together with 
>>>> the other mem stats code.
>>>>
>>>> I can move it to amdgpu_bo.c make it return the memtype verbatim and 
>>>> just fold it when I do the accounting.
>>>
>>> I think that folding GDS, GWS and OA into system is also a bug. We 
>>> should really not doing that.
>>>
>>> Just wanted to point out for this round that the code to query the 
>>> current placement from a BO should probably go into amdgpu_bo.c and 
>>> not amdgpu_vm.c
>>>
>>>>
>>>>>> +   struct ttm_resource *res = bo->tbo.resource;
>>>>>> +   const uint32_t domain_to_pl[] = {
>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_CPU)]      = TTM_PL_SYSTEM,
>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GTT)]      = TTM_PL_TT,
>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_VRAM)]     = TTM_PL_VRAM,
>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GDS)]      = AMDGPU_PL_GDS,
>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GWS)]      = AMDGPU_PL_GWS,
>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_OA)]       = AMDGPU_PL_OA,
>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_DOORBELL)] =
>>>>> AMDGPU_PL_DOORBELL,
>>>>>> +   };
>>>>>> +   uint32_t domain;
>>>>>> +
>>>>>> +   if (res)
>>>>>> +           return fold_memtype(res->mem_type);
>>>>>> +
>>>>>> +   /*
>>>>>> +    * If no backing store use one of the preferred domain for basic
>>>>>> +    * stats. We take the MSB since that should give a reasonable
>>>>>> +    * view.
>>>>>> +    */
>>>>>> +   BUILD_BUG_ON(TTM_PL_VRAM < TTM_PL_TT || TTM_PL_VRAM <
>>>>> TTM_PL_SYSTEM);
>>>>>> +   domain = fls(bo->preferred_domains & AMDGPU_GEM_DOMAIN_MASK);
>>>>>> +   if (drm_WARN_ON_ONCE(&adev->ddev,
>>>>>> +                        domain == 0 || --domain >= 
>>>>>> ARRAY_SIZE(domain_to_pl)))
>>>>> It's perfectly legal to create a BO without a placement. That one 
>>>>> just won't have a
>>>>> backing store.
>>>>>
>>>> This is lifted from the previous change I'm rebasing onto. I think 
>>>> what it’s trying to do is if the BO doesn't have a placement, use 
>>>> the "biggest" (VRAM > TT > SYSTEM) preferred placement for the 
>>>> purpose of accounting. Previously we just ignore BOs that doesn't 
>>>> have a placement. I guess there's argument for going with either 
>>>> approaches.
>>>
>>> I was not arguing, I'm simply pointing out a bug. It's perfectly 
>>> valid for bo->preferred_domains to be 0.
>>>
>>> So the following WARN_ON() that no bit is set is incorrect.
>>>
>>>>
>>>>>> +           return 0;
>>>>>> +   return fold_memtype(domain_to_pl[domain])
>>>>> That would need specular execution mitigation if I'm not completely 
>>>>> mistaken.
>>>>>
>>>>> Better use a switch/case statement.
>>>>>
>>>> Do you mean change the array indexing to a switch statement?
>>>
>>> Yes.
>>
>> Did you mean array_index_nospec?
> 
> Yes.
> 
>> Domain is not a direct userspace input and is calculated from the mask 
>> which sanitized to allowed values prior to this call. So I *think* 
>> switch is an overkill but don't mind it either. Just commenting FWIW.
> 
> I missed that the mask is applied.
> 
> Thinking more about it I'm not sure if we should do this conversion in 
> the first place. IIRC Tvrtko you once suggested a patch which switched a 
> bunch of code to use the TTM placement instead of the UAPI flags.

Maybe 8fb0efb10184 ("drm/amdgpu: Reduce mem_type to domain double 
indirection") is what are you thinking of?

> Going more into this direction I think when we want to look at the 
> current placement we should probably also use the TTM PL enumeration 
> directly.

It does this already. The placement flags are just to "invent" a TTM PL 
enum when bo->tbo.resource == NULL.

         if (!res) {
                 /*
                  * If no backing store use one of the preferred domain 
for basic
                  * stats. We take the MSB since that should give a 
reasonable
                  * view.
                  */
                 BUILD_BUG_ON(TTM_PL_VRAM < TTM_PL_TT ||
                              TTM_PL_VRAM < TTM_PL_SYSTEM);
                 type = fls(bo->preferred_domains & AMDGPU_GEM_DOMAIN_MASK);
                 if (!type)
                         return;
                 type--;
                 if (drm_WARN_ON_ONCE(&adev->ddev,
                                      type >= ARRAY_SIZE(domain_to_pl)))
                         return;
                 type = domain_to_pl[type];
         } else {
                 type = res->mem_type;
         }
...
         stats[type].total += size;
         if (drm_gem_object_is_shared_for_memory_stats(obj))
                 stats[type].drm.shared += size;
         else
                 stats[type].drm.private += size;
... etc ...

So all actual stat accounting is based on TTM PL type enum. And then 
later at fdinfo print time:

         for (i = 0; i < TTM_PL_PRIV; i++)
                 drm_print_memory_stats(p,
                                        &stats[i].drm,
                                        DRM_GEM_OBJECT_RESIDENT |
                                        DRM_GEM_OBJECT_PURGEABLE,
                                        pl_name[i]);

Again, based of the same enum. Not sure if you have something other in 
mind or you are happy with that?

Then what Teddy does is IMO only tangential, he just changes when stats 
are collected and not this aspect.

To fold or not the special placements (GWS, GDS & co) is also 
tangential. In my patch I just preserved the legacy behaviour so it can 
easily be tweaked on top.

Regards,

Tvrtko

> 
> Regards,
> Christian.
> 
>>
>> Regards,
>>
>> Tvrtko
>

Christian König Oct. 23, 2024, 12:12 p.m. UTC | #12

Am 23.10.24 um 13:37 schrieb Tvrtko Ursulin:
>
> On 23/10/2024 10:14, Christian König wrote:
>> Am 23.10.24 um 09:38 schrieb Tvrtko Ursulin:
>>>
>>> On 22/10/2024 17:24, Christian König wrote:
>>>> Am 22.10.24 um 17:17 schrieb Li, Yunxiang (Teddy):
>>>>> [Public]
>>>>>
>>>>>>> +static uint32_t fold_memtype(uint32_t memtype) {
>>>>>> In general please add prefixes to even static functions, e.g. 
>>>>>> amdgpu_vm_ or
>>>>>> amdgpu_bo_.
>>>>>>
>>>>>>> +   /* Squash private placements into 'cpu' to keep the legacy 
>>>>>>> userspace view.
>>>>>> */
>>>>>>> +   switch (mem_type) {
>>>>>>> +   case TTM_PL_VRAM:
>>>>>>> +   case TTM_PL_TT:
>>>>>>> +           return memtype
>>>>>>> +   default:
>>>>>>> +           return TTM_PL_SYSTEM;
>>>>>>> +   }
>>>>>>> +}
>>>>>>> +
>>>>>>> +static uint32_t bo_get_memtype(struct amdgpu_bo *bo) {
>>>>>> That whole function belongs into amdgpu_bo.c
>>>>> Do you mean bo_get_memtype or fold_memtype? I debated whether 
>>>>> bo_get_memtype should go into amdgpu_vm.c or amdgpu_bo.c, and 
>>>>> since it's using fold_memtype and only useful for memory stats 
>>>>> because of folding the private placements I just left them here 
>>>>> together with the other mem stats code.
>>>>>
>>>>> I can move it to amdgpu_bo.c make it return the memtype verbatim 
>>>>> and just fold it when I do the accounting.
>>>>
>>>> I think that folding GDS, GWS and OA into system is also a bug. We 
>>>> should really not doing that.
>>>>
>>>> Just wanted to point out for this round that the code to query the 
>>>> current placement from a BO should probably go into amdgpu_bo.c and 
>>>> not amdgpu_vm.c
>>>>
>>>>>
>>>>>>> +   struct ttm_resource *res = bo->tbo.resource;
>>>>>>> +   const uint32_t domain_to_pl[] = {
>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_CPU)]      = TTM_PL_SYSTEM,
>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GTT)]      = TTM_PL_TT,
>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_VRAM)]     = TTM_PL_VRAM,
>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GDS)]      = AMDGPU_PL_GDS,
>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GWS)]      = AMDGPU_PL_GWS,
>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_OA)]       = AMDGPU_PL_OA,
>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_DOORBELL)] =
>>>>>> AMDGPU_PL_DOORBELL,
>>>>>>> +   };
>>>>>>> +   uint32_t domain;
>>>>>>> +
>>>>>>> +   if (res)
>>>>>>> +           return fold_memtype(res->mem_type);
>>>>>>> +
>>>>>>> +   /*
>>>>>>> +    * If no backing store use one of the preferred domain for 
>>>>>>> basic
>>>>>>> +    * stats. We take the MSB since that should give a reasonable
>>>>>>> +    * view.
>>>>>>> +    */
>>>>>>> +   BUILD_BUG_ON(TTM_PL_VRAM < TTM_PL_TT || TTM_PL_VRAM <
>>>>>> TTM_PL_SYSTEM);
>>>>>>> +   domain = fls(bo->preferred_domains & AMDGPU_GEM_DOMAIN_MASK);
>>>>>>> +   if (drm_WARN_ON_ONCE(&adev->ddev,
>>>>>>> +                        domain == 0 || --domain >= 
>>>>>>> ARRAY_SIZE(domain_to_pl)))
>>>>>> It's perfectly legal to create a BO without a placement. That one 
>>>>>> just won't have a
>>>>>> backing store.
>>>>>>
>>>>> This is lifted from the previous change I'm rebasing onto. I think 
>>>>> what it’s trying to do is if the BO doesn't have a placement, use 
>>>>> the "biggest" (VRAM > TT > SYSTEM) preferred placement for the 
>>>>> purpose of accounting. Previously we just ignore BOs that doesn't 
>>>>> have a placement. I guess there's argument for going with either 
>>>>> approaches.
>>>>
>>>> I was not arguing, I'm simply pointing out a bug. It's perfectly 
>>>> valid for bo->preferred_domains to be 0.
>>>>
>>>> So the following WARN_ON() that no bit is set is incorrect.
>>>>
>>>>>
>>>>>>> +           return 0;
>>>>>>> +   return fold_memtype(domain_to_pl[domain])
>>>>>> That would need specular execution mitigation if I'm not 
>>>>>> completely mistaken.
>>>>>>
>>>>>> Better use a switch/case statement.
>>>>>>
>>>>> Do you mean change the array indexing to a switch statement?
>>>>
>>>> Yes.
>>>
>>> Did you mean array_index_nospec?
>>
>> Yes.
>>
>>> Domain is not a direct userspace input and is calculated from the 
>>> mask which sanitized to allowed values prior to this call. So I 
>>> *think* switch is an overkill but don't mind it either. Just 
>>> commenting FWIW.
>>
>> I missed that the mask is applied.
>>
>> Thinking more about it I'm not sure if we should do this conversion 
>> in the first place. IIRC Tvrtko you once suggested a patch which 
>> switched a bunch of code to use the TTM placement instead of the UAPI 
>> flags.
>
> Maybe 8fb0efb10184 ("drm/amdgpu: Reduce mem_type to domain double 
> indirection") is what are you thinking of?

Yes, exactly that one.

>
>> Going more into this direction I think when we want to look at the 
>> current placement we should probably also use the TTM PL enumeration 
>> directly.
>
> It does this already. The placement flags are just to "invent" a TTM 
> PL enum when bo->tbo.resource == NULL.

Ah, good point! I though we would do the mapping the other way around.

In this case that is even more something we should probably not do at all.

When bo->tbo.resource is NULL then this BO isn't resident at all, so it 
should not account to resident memory.

> Again, based of the same enum. Not sure if you have something other in 
> mind or you are happy with that?

I think that for drm-total-* we should use the GEM flags and for 
drm-resident-* we should use the TTM placement.

>
> Then what Teddy does is IMO only tangential, he just changes when 
> stats are collected and not this aspect.

Yeah, right but we should probably fix it up in the right way while on it.

>
> To fold or not the special placements (GWS, GDS & co) is also 
> tangential. In my patch I just preserved the legacy behaviour so it 
> can easily be tweaked on top.

Yeah, but again the original behavior is completely broken.

GWS, GDS and OA are counted in blocks of HW units (multiplied by 
PAGE_SIZE IIRC to avoid some GEM&TTM warnings).

When you accumulate that anywhere in the memory stats then that is just 
completely off.

Regards,
Christian.

>
> Regards,
>
> Tvrtko
>
>>
>> Regards,
>> Christian.
>>
>>>
>>> Regards,
>>>
>>> Tvrtko
>>

Tvrtko Ursulin Oct. 23, 2024, 12:24 p.m. UTC | #13

On 23/10/2024 13:12, Christian König wrote:
> Am 23.10.24 um 13:37 schrieb Tvrtko Ursulin:
>>
>> On 23/10/2024 10:14, Christian König wrote:
>>> Am 23.10.24 um 09:38 schrieb Tvrtko Ursulin:
>>>>
>>>> On 22/10/2024 17:24, Christian König wrote:
>>>>> Am 22.10.24 um 17:17 schrieb Li, Yunxiang (Teddy):
>>>>>> [Public]
>>>>>>
>>>>>>>> +static uint32_t fold_memtype(uint32_t memtype) {
>>>>>>> In general please add prefixes to even static functions, e.g. 
>>>>>>> amdgpu_vm_ or
>>>>>>> amdgpu_bo_.
>>>>>>>
>>>>>>>> +   /* Squash private placements into 'cpu' to keep the legacy 
>>>>>>>> userspace view.
>>>>>>> */
>>>>>>>> +   switch (mem_type) {
>>>>>>>> +   case TTM_PL_VRAM:
>>>>>>>> +   case TTM_PL_TT:
>>>>>>>> +           return memtype
>>>>>>>> +   default:
>>>>>>>> +           return TTM_PL_SYSTEM;
>>>>>>>> +   }
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static uint32_t bo_get_memtype(struct amdgpu_bo *bo) {
>>>>>>> That whole function belongs into amdgpu_bo.c
>>>>>> Do you mean bo_get_memtype or fold_memtype? I debated whether 
>>>>>> bo_get_memtype should go into amdgpu_vm.c or amdgpu_bo.c, and 
>>>>>> since it's using fold_memtype and only useful for memory stats 
>>>>>> because of folding the private placements I just left them here 
>>>>>> together with the other mem stats code.
>>>>>>
>>>>>> I can move it to amdgpu_bo.c make it return the memtype verbatim 
>>>>>> and just fold it when I do the accounting.
>>>>>
>>>>> I think that folding GDS, GWS and OA into system is also a bug. We 
>>>>> should really not doing that.
>>>>>
>>>>> Just wanted to point out for this round that the code to query the 
>>>>> current placement from a BO should probably go into amdgpu_bo.c and 
>>>>> not amdgpu_vm.c
>>>>>
>>>>>>
>>>>>>>> +   struct ttm_resource *res = bo->tbo.resource;
>>>>>>>> +   const uint32_t domain_to_pl[] = {
>>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_CPU)]      = TTM_PL_SYSTEM,
>>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GTT)]      = TTM_PL_TT,
>>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_VRAM)]     = TTM_PL_VRAM,
>>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GDS)]      = AMDGPU_PL_GDS,
>>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GWS)]      = AMDGPU_PL_GWS,
>>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_OA)]       = AMDGPU_PL_OA,
>>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_DOORBELL)] =
>>>>>>> AMDGPU_PL_DOORBELL,
>>>>>>>> +   };
>>>>>>>> +   uint32_t domain;
>>>>>>>> +
>>>>>>>> +   if (res)
>>>>>>>> +           return fold_memtype(res->mem_type);
>>>>>>>> +
>>>>>>>> +   /*
>>>>>>>> +    * If no backing store use one of the preferred domain for 
>>>>>>>> basic
>>>>>>>> +    * stats. We take the MSB since that should give a reasonable
>>>>>>>> +    * view.
>>>>>>>> +    */
>>>>>>>> +   BUILD_BUG_ON(TTM_PL_VRAM < TTM_PL_TT || TTM_PL_VRAM <
>>>>>>> TTM_PL_SYSTEM);
>>>>>>>> +   domain = fls(bo->preferred_domains & AMDGPU_GEM_DOMAIN_MASK);
>>>>>>>> +   if (drm_WARN_ON_ONCE(&adev->ddev,
>>>>>>>> +                        domain == 0 || --domain >= 
>>>>>>>> ARRAY_SIZE(domain_to_pl)))
>>>>>>> It's perfectly legal to create a BO without a placement. That one 
>>>>>>> just won't have a
>>>>>>> backing store.
>>>>>>>
>>>>>> This is lifted from the previous change I'm rebasing onto. I think 
>>>>>> what it’s trying to do is if the BO doesn't have a placement, use 
>>>>>> the "biggest" (VRAM > TT > SYSTEM) preferred placement for the 
>>>>>> purpose of accounting. Previously we just ignore BOs that doesn't 
>>>>>> have a placement. I guess there's argument for going with either 
>>>>>> approaches.
>>>>>
>>>>> I was not arguing, I'm simply pointing out a bug. It's perfectly 
>>>>> valid for bo->preferred_domains to be 0.
>>>>>
>>>>> So the following WARN_ON() that no bit is set is incorrect.
>>>>>
>>>>>>
>>>>>>>> +           return 0;
>>>>>>>> +   return fold_memtype(domain_to_pl[domain])
>>>>>>> That would need specular execution mitigation if I'm not 
>>>>>>> completely mistaken.
>>>>>>>
>>>>>>> Better use a switch/case statement.
>>>>>>>
>>>>>> Do you mean change the array indexing to a switch statement?
>>>>>
>>>>> Yes.
>>>>
>>>> Did you mean array_index_nospec?
>>>
>>> Yes.
>>>
>>>> Domain is not a direct userspace input and is calculated from the 
>>>> mask which sanitized to allowed values prior to this call. So I 
>>>> *think* switch is an overkill but don't mind it either. Just 
>>>> commenting FWIW.
>>>
>>> I missed that the mask is applied.
>>>
>>> Thinking more about it I'm not sure if we should do this conversion 
>>> in the first place. IIRC Tvrtko you once suggested a patch which 
>>> switched a bunch of code to use the TTM placement instead of the UAPI 
>>> flags.
>>
>> Maybe 8fb0efb10184 ("drm/amdgpu: Reduce mem_type to domain double 
>> indirection") is what are you thinking of?
> 
> Yes, exactly that one.
> 
>>
>>> Going more into this direction I think when we want to look at the 
>>> current placement we should probably also use the TTM PL enumeration 
>>> directly.
>>
>> It does this already. The placement flags are just to "invent" a TTM 
>> PL enum when bo->tbo.resource == NULL.
> 
> Ah, good point! I though we would do the mapping the other way around.
> 
> In this case that is even more something we should probably not do at all.
> 
> When bo->tbo.resource is NULL then this BO isn't resident at all, so it 
> should not account to resident memory.

It doesn't, only for total. I should have pasted more context..:

	struct ttm_resource *res = bo->tbo.resource;
...
         /* DRM stats common fields: */

         stats[type].total += size;
         if (drm_gem_object_is_shared_for_memory_stats(obj))
                 stats[type].drm.shared += size;
         else
                 stats[type].drm.private += size;

         if (res) {
                 stats[type].drm.resident += size

So if no current placement it does not count towards drm-resident-, only 
drm-total- (which is drm.private + drm.resident). Total and resident 
intend to be analogue to for instance VIRT and RES in top(1), or VZS and 
RSS in ps(1).

>> Again, based of the same enum. Not sure if you have something other in 
>> mind or you are happy with that?
> 
> I think that for drm-total-* we should use the GEM flags and for 
> drm-resident-* we should use the TTM placement.

Agreed! :)

>>
>> Then what Teddy does is IMO only tangential, he just changes when 
>> stats are collected and not this aspect.
> 
> Yeah, right but we should probably fix it up in the right way while on it.

Okay, we just need to align on is there a problem and how to fix it.

>> To fold or not the special placements (GWS, GDS & co) is also 
>> tangential. In my patch I just preserved the legacy behaviour so it 
>> can easily be tweaked on top.
> 
> Yeah, but again the original behavior is completely broken.
> 
> GWS, GDS and OA are counted in blocks of HW units (multiplied by 
> PAGE_SIZE IIRC to avoid some GEM&TTM warnings).
> 
> When you accumulate that anywhere in the memory stats then that is just 
> completely off.

Ooops. :) Are they backed by some memory though, be it system or VRAM?

Regards,

Tvrtko

Christian König Oct. 23, 2024, 12:56 p.m. UTC | #14

Am 23.10.24 um 14:24 schrieb Tvrtko Ursulin:
> [SNIP]
>>> To fold or not the special placements (GWS, GDS & co) is also 
>>> tangential. In my patch I just preserved the legacy behaviour so it 
>>> can easily be tweaked on top.
>>
>> Yeah, but again the original behavior is completely broken.
>>
>> GWS, GDS and OA are counted in blocks of HW units (multiplied by 
>> PAGE_SIZE IIRC to avoid some GEM&TTM warnings).
>>
>> When you accumulate that anywhere in the memory stats then that is 
>> just completely off.
>
> Ooops. :) Are they backed by some memory though, be it system or VRAM?

GDS is an internal 4 or 64KiB memory block which is only valid while 
shaders are running. It is used to communicate stuff between different 
shader stages and not even CPU accessible.

GWS and OA are not even memory, those are just HW blocks which implement 
a fixed function.

IIRC most HW generation have 16 of each and when setting up the 
application virtual address space you can specify how many will be used 
by the application.

Regards,
Christian.

>
> Regards,
>
> Tvrtko

Yunxiang Li Oct. 23, 2024, 1:31 p.m. UTC | #15

[AMD Official Use Only - AMD Internal Distribution Only]

> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
> Sent: Wednesday, October 23, 2024 8:25
> On 23/10/2024 13:12, Christian König wrote:
> > Am 23.10.24 um 13:37 schrieb Tvrtko Ursulin:
> >>
> >> On 23/10/2024 10:14, Christian König wrote:
> >>> Am 23.10.24 um 09:38 schrieb Tvrtko Ursulin:
> >>>>
> >>>> On 22/10/2024 17:24, Christian König wrote:
> >>>>> Am 22.10.24 um 17:17 schrieb Li, Yunxiang (Teddy):
> >>>>>> [Public]
> >>>>>>
> >>>>>>>> +static uint32_t fold_memtype(uint32_t memtype) {
> >>>>>>> In general please add prefixes to even static functions, e.g.
> >>>>>>> amdgpu_vm_ or
> >>>>>>> amdgpu_bo_.
> >>>>>>>
> >>>>>>>> +   /* Squash private placements into 'cpu' to keep the legacy
> >>>>>>>> userspace view.
> >>>>>>> */
> >>>>>>>> +   switch (mem_type) {
> >>>>>>>> +   case TTM_PL_VRAM:
> >>>>>>>> +   case TTM_PL_TT:
> >>>>>>>> +           return memtype
> >>>>>>>> +   default:
> >>>>>>>> +           return TTM_PL_SYSTEM;
> >>>>>>>> +   }
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static uint32_t bo_get_memtype(struct amdgpu_bo *bo) {
> >>>>>>> That whole function belongs into amdgpu_bo.c
> >>>>>> Do you mean bo_get_memtype or fold_memtype? I debated whether
> >>>>>> bo_get_memtype should go into amdgpu_vm.c or amdgpu_bo.c, and
> >>>>>> since it's using fold_memtype and only useful for memory stats
> >>>>>> because of folding the private placements I just left them here
> >>>>>> together with the other mem stats code.
> >>>>>>
> >>>>>> I can move it to amdgpu_bo.c make it return the memtype verbatim
> >>>>>> and just fold it when I do the accounting.
> >>>>>
> >>>>> I think that folding GDS, GWS and OA into system is also a bug. We
> >>>>> should really not doing that.
> >>>>>
> >>>>> Just wanted to point out for this round that the code to query the
> >>>>> current placement from a BO should probably go into amdgpu_bo.c
> >>>>> and not amdgpu_vm.c
> >>>>>
> >>>>>>
> >>>>>>>> +   struct ttm_resource *res = bo->tbo.resource;
> >>>>>>>> +   const uint32_t domain_to_pl[] = {
> >>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_CPU)]      =
> >>>>>>>> +TTM_PL_SYSTEM,
> >>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GTT)]      = TTM_PL_TT,
> >>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_VRAM)]     =
> TTM_PL_VRAM,
> >>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GDS)]      =
> >>>>>>>> +AMDGPU_PL_GDS,
> >>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GWS)]      =
> >>>>>>>> +AMDGPU_PL_GWS,
> >>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_OA)]       =
> AMDGPU_PL_OA,
> >>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_DOORBELL)] =
> >>>>>>> AMDGPU_PL_DOORBELL,
> >>>>>>>> +   };
> >>>>>>>> +   uint32_t domain;
> >>>>>>>> +
> >>>>>>>> +   if (res)
> >>>>>>>> +           return fold_memtype(res->mem_type);
> >>>>>>>> +
> >>>>>>>> +   /*
> >>>>>>>> +    * If no backing store use one of the preferred domain for
> >>>>>>>> basic
> >>>>>>>> +    * stats. We take the MSB since that should give a
> >>>>>>>> +reasonable
> >>>>>>>> +    * view.
> >>>>>>>> +    */
> >>>>>>>> +   BUILD_BUG_ON(TTM_PL_VRAM < TTM_PL_TT ||
> TTM_PL_VRAM <
> >>>>>>> TTM_PL_SYSTEM);
> >>>>>>>> +   domain = fls(bo->preferred_domains &
> >>>>>>>> +AMDGPU_GEM_DOMAIN_MASK);
> >>>>>>>> +   if (drm_WARN_ON_ONCE(&adev->ddev,
> >>>>>>>> +                        domain == 0 || --domain >=
> >>>>>>>> ARRAY_SIZE(domain_to_pl)))
> >>>>>>> It's perfectly legal to create a BO without a placement. That
> >>>>>>> one just won't have a backing store.
> >>>>>>>
> >>>>>> This is lifted from the previous change I'm rebasing onto. I
> >>>>>> think what it’s trying to do is if the BO doesn't have a
> >>>>>> placement, use the "biggest" (VRAM > TT > SYSTEM) preferred
> >>>>>> placement for the purpose of accounting. Previously we just
> >>>>>> ignore BOs that doesn't have a placement. I guess there's
> >>>>>> argument for going with either approaches.
> >>>>>
> >>>>> I was not arguing, I'm simply pointing out a bug. It's perfectly
> >>>>> valid for bo->preferred_domains to be 0.
> >>>>>
> >>>>> So the following WARN_ON() that no bit is set is incorrect.
> >>>>>
> >>>>>>
> >>>>>>>> +           return 0;
> >>>>>>>> +   return fold_memtype(domain_to_pl[domain])
> >>>>>>> That would need specular execution mitigation if I'm not
> >>>>>>> completely mistaken.
> >>>>>>>
> >>>>>>> Better use a switch/case statement.
> >>>>>>>
> >>>>>> Do you mean change the array indexing to a switch statement?
> >>>>>
> >>>>> Yes.
> >>>>
> >>>> Did you mean array_index_nospec?
> >>>
> >>> Yes.
> >>>
> >>>> Domain is not a direct userspace input and is calculated from the
> >>>> mask which sanitized to allowed values prior to this call. So I
> >>>> *think* switch is an overkill but don't mind it either. Just
> >>>> commenting FWIW.
> >>>
> >>> I missed that the mask is applied.
> >>>
> >>> Thinking more about it I'm not sure if we should do this conversion
> >>> in the first place. IIRC Tvrtko you once suggested a patch which
> >>> switched a bunch of code to use the TTM placement instead of the
> >>> UAPI flags.
> >>
> >> Maybe 8fb0efb10184 ("drm/amdgpu: Reduce mem_type to domain double
> >> indirection") is what are you thinking of?
> >
> > Yes, exactly that one.
> >
> >>
> >>> Going more into this direction I think when we want to look at the
> >>> current placement we should probably also use the TTM PL enumeration
> >>> directly.
> >>
> >> It does this already. The placement flags are just to "invent" a TTM
> >> PL enum when bo->tbo.resource == NULL.
> >
> > Ah, good point! I though we would do the mapping the other way around.
> >
> > In this case that is even more something we should probably not do at all.
> >
> > When bo->tbo.resource is NULL then this BO isn't resident at all, so
> > it should not account to resident memory.
>
> It doesn't, only for total. I should have pasted more context..:
>
>       struct ttm_resource *res = bo->tbo.resource; ...
>          /* DRM stats common fields: */
>
>          stats[type].total += size;
>          if (drm_gem_object_is_shared_for_memory_stats(obj))
>                  stats[type].drm.shared += size;
>          else
>                  stats[type].drm.private += size;
>
>          if (res) {
>                  stats[type].drm.resident += size
>
> So if no current placement it does not count towards drm-resident-, only
> drm-total- (which is drm.private + drm.resident). Total and resident intend to be
> analogue to for instance VIRT and RES in top(1), or VZS and RSS in ps(1).
>
> >> Again, based of the same enum. Not sure if you have something other
> >> in mind or you are happy with that?
> >
> > I think that for drm-total-* we should use the GEM flags and for
> > drm-resident-* we should use the TTM placement.
>
> Agreed! :)
>

Oof I missed the distinction between resident and total as well. Just want to double confirm the drm-total- semantics.

Does drm-total- track the BOs that prefer the placement (derived from the preferred domain) and drm-resident- track the actual placement, or does drm-total- track drm-resident- plus BOs that don't have a placement but prefers here?

> >>
> >> Then what Teddy does is IMO only tangential, he just changes when
> >> stats are collected and not this aspect.
> >
> > Yeah, right but we should probably fix it up in the right way while on it.
>
> Okay, we just need to align on is there a problem and how to fix it.
>
> >> To fold or not the special placements (GWS, GDS & co) is also
> >> tangential. In my patch I just preserved the legacy behaviour so it
> >> can easily be tweaked on top.
> >
> > Yeah, but again the original behavior is completely broken.
> >
> > GWS, GDS and OA are counted in blocks of HW units (multiplied by
> > PAGE_SIZE IIRC to avoid some GEM&TTM warnings).
> >
> > When you accumulate that anywhere in the memory stats then that is
> > just completely off.
>
> Ooops. :) Are they backed by some memory though, be it system or VRAM?
>
> Regards,
>
> Tvrtko

Yunxiang Li Oct. 23, 2024, 1:40 p.m. UTC | #16

[AMD Official Use Only - AMD Internal Distribution Only]

Yeah it looks like I missed the whole active/purgeable thing as well...

Teddy

Tvrtko Ursulin Oct. 23, 2024, 2:27 p.m. UTC | #17

On 23/10/2024 14:31, Li, Yunxiang (Teddy) wrote:
> [AMD Official Use Only - AMD Internal Distribution Only]
> 
>> From: Tvrtko Ursulin <tvrtko.ursulin@igalia.com>
>> Sent: Wednesday, October 23, 2024 8:25
>> On 23/10/2024 13:12, Christian König wrote:
>>> Am 23.10.24 um 13:37 schrieb Tvrtko Ursulin:
>>>>
>>>> On 23/10/2024 10:14, Christian König wrote:
>>>>> Am 23.10.24 um 09:38 schrieb Tvrtko Ursulin:
>>>>>>
>>>>>> On 22/10/2024 17:24, Christian König wrote:
>>>>>>> Am 22.10.24 um 17:17 schrieb Li, Yunxiang (Teddy):
>>>>>>>> [Public]
>>>>>>>>
>>>>>>>>>> +static uint32_t fold_memtype(uint32_t memtype) {
>>>>>>>>> In general please add prefixes to even static functions, e.g.
>>>>>>>>> amdgpu_vm_ or
>>>>>>>>> amdgpu_bo_.
>>>>>>>>>
>>>>>>>>>> +   /* Squash private placements into 'cpu' to keep the legacy
>>>>>>>>>> userspace view.
>>>>>>>>> */
>>>>>>>>>> +   switch (mem_type) {
>>>>>>>>>> +   case TTM_PL_VRAM:
>>>>>>>>>> +   case TTM_PL_TT:
>>>>>>>>>> +           return memtype
>>>>>>>>>> +   default:
>>>>>>>>>> +           return TTM_PL_SYSTEM;
>>>>>>>>>> +   }
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static uint32_t bo_get_memtype(struct amdgpu_bo *bo) {
>>>>>>>>> That whole function belongs into amdgpu_bo.c
>>>>>>>> Do you mean bo_get_memtype or fold_memtype? I debated whether
>>>>>>>> bo_get_memtype should go into amdgpu_vm.c or amdgpu_bo.c, and
>>>>>>>> since it's using fold_memtype and only useful for memory stats
>>>>>>>> because of folding the private placements I just left them here
>>>>>>>> together with the other mem stats code.
>>>>>>>>
>>>>>>>> I can move it to amdgpu_bo.c make it return the memtype verbatim
>>>>>>>> and just fold it when I do the accounting.
>>>>>>>
>>>>>>> I think that folding GDS, GWS and OA into system is also a bug. We
>>>>>>> should really not doing that.
>>>>>>>
>>>>>>> Just wanted to point out for this round that the code to query the
>>>>>>> current placement from a BO should probably go into amdgpu_bo.c
>>>>>>> and not amdgpu_vm.c
>>>>>>>
>>>>>>>>
>>>>>>>>>> +   struct ttm_resource *res = bo->tbo.resource;
>>>>>>>>>> +   const uint32_t domain_to_pl[] = {
>>>>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_CPU)]      =
>>>>>>>>>> +TTM_PL_SYSTEM,
>>>>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GTT)]      = TTM_PL_TT,
>>>>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_VRAM)]     =
>> TTM_PL_VRAM,
>>>>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GDS)]      =
>>>>>>>>>> +AMDGPU_PL_GDS,
>>>>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_GWS)]      =
>>>>>>>>>> +AMDGPU_PL_GWS,
>>>>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_OA)]       =
>> AMDGPU_PL_OA,
>>>>>>>>>> +           [ilog2(AMDGPU_GEM_DOMAIN_DOORBELL)] =
>>>>>>>>> AMDGPU_PL_DOORBELL,
>>>>>>>>>> +   };
>>>>>>>>>> +   uint32_t domain;
>>>>>>>>>> +
>>>>>>>>>> +   if (res)
>>>>>>>>>> +           return fold_memtype(res->mem_type);
>>>>>>>>>> +
>>>>>>>>>> +   /*
>>>>>>>>>> +    * If no backing store use one of the preferred domain for
>>>>>>>>>> basic
>>>>>>>>>> +    * stats. We take the MSB since that should give a
>>>>>>>>>> +reasonable
>>>>>>>>>> +    * view.
>>>>>>>>>> +    */
>>>>>>>>>> +   BUILD_BUG_ON(TTM_PL_VRAM < TTM_PL_TT ||
>> TTM_PL_VRAM <
>>>>>>>>> TTM_PL_SYSTEM);
>>>>>>>>>> +   domain = fls(bo->preferred_domains &
>>>>>>>>>> +AMDGPU_GEM_DOMAIN_MASK);
>>>>>>>>>> +   if (drm_WARN_ON_ONCE(&adev->ddev,
>>>>>>>>>> +                        domain == 0 || --domain >=
>>>>>>>>>> ARRAY_SIZE(domain_to_pl)))
>>>>>>>>> It's perfectly legal to create a BO without a placement. That
>>>>>>>>> one just won't have a backing store.
>>>>>>>>>
>>>>>>>> This is lifted from the previous change I'm rebasing onto. I
>>>>>>>> think what it’s trying to do is if the BO doesn't have a
>>>>>>>> placement, use the "biggest" (VRAM > TT > SYSTEM) preferred
>>>>>>>> placement for the purpose of accounting. Previously we just
>>>>>>>> ignore BOs that doesn't have a placement. I guess there's
>>>>>>>> argument for going with either approaches.
>>>>>>>
>>>>>>> I was not arguing, I'm simply pointing out a bug. It's perfectly
>>>>>>> valid for bo->preferred_domains to be 0.
>>>>>>>
>>>>>>> So the following WARN_ON() that no bit is set is incorrect.
>>>>>>>
>>>>>>>>
>>>>>>>>>> +           return 0;
>>>>>>>>>> +   return fold_memtype(domain_to_pl[domain])
>>>>>>>>> That would need specular execution mitigation if I'm not
>>>>>>>>> completely mistaken.
>>>>>>>>>
>>>>>>>>> Better use a switch/case statement.
>>>>>>>>>
>>>>>>>> Do you mean change the array indexing to a switch statement?
>>>>>>>
>>>>>>> Yes.
>>>>>>
>>>>>> Did you mean array_index_nospec?
>>>>>
>>>>> Yes.
>>>>>
>>>>>> Domain is not a direct userspace input and is calculated from the
>>>>>> mask which sanitized to allowed values prior to this call. So I
>>>>>> *think* switch is an overkill but don't mind it either. Just
>>>>>> commenting FWIW.
>>>>>
>>>>> I missed that the mask is applied.
>>>>>
>>>>> Thinking more about it I'm not sure if we should do this conversion
>>>>> in the first place. IIRC Tvrtko you once suggested a patch which
>>>>> switched a bunch of code to use the TTM placement instead of the
>>>>> UAPI flags.
>>>>
>>>> Maybe 8fb0efb10184 ("drm/amdgpu: Reduce mem_type to domain double
>>>> indirection") is what are you thinking of?
>>>
>>> Yes, exactly that one.
>>>
>>>>
>>>>> Going more into this direction I think when we want to look at the
>>>>> current placement we should probably also use the TTM PL enumeration
>>>>> directly.
>>>>
>>>> It does this already. The placement flags are just to "invent" a TTM
>>>> PL enum when bo->tbo.resource == NULL.
>>>
>>> Ah, good point! I though we would do the mapping the other way around.
>>>
>>> In this case that is even more something we should probably not do at all.
>>>
>>> When bo->tbo.resource is NULL then this BO isn't resident at all, so
>>> it should not account to resident memory.
>>
>> It doesn't, only for total. I should have pasted more context..:
>>
>>        struct ttm_resource *res = bo->tbo.resource; ...
>>           /* DRM stats common fields: */
>>
>>           stats[type].total += size;
>>           if (drm_gem_object_is_shared_for_memory_stats(obj))
>>                   stats[type].drm.shared += size;
>>           else
>>                   stats[type].drm.private += size;
>>
>>           if (res) {
>>                   stats[type].drm.resident += size
>>
>> So if no current placement it does not count towards drm-resident-, only
>> drm-total- (which is drm.private + drm.resident). Total and resident intend to be
>> analogue to for instance VIRT and RES in top(1), or VZS and RSS in ps(1).
>>
>>>> Again, based of the same enum. Not sure if you have something other
>>>> in mind or you are happy with that?
>>>
>>> I think that for drm-total-* we should use the GEM flags and for
>>> drm-resident-* we should use the TTM placement.
>>
>> Agreed! :)
>>
> 
> Oof I missed the distinction between resident and total as well. Just want to double confirm the drm-total- semantics.
> 
> Does drm-total- track the BOs that prefer the placement (derived from the preferred domain) and drm-resident- track the actual placement, or does drm-total- track drm-resident- plus BOs that don't have a placement but prefers here?

Preferred domain is only used as an aid when there is no backing store. 
Putting that aside, it is supposed to be simple:

Total - BO exists, whether or not it currently has a backing store.

Resident - BO has a backing store.

Active and purgeable are a sub-variant of resident.

Again, preferred placement only comes into the picture (in the current 
implementation) when there is not backing store. Because we must account 
the total and looking at the preferred tells us to where.

Yeah it may be that one second you have:

total-vram: 4 KiB
resident-vram: 0
total-system: 0
resident-system: 0

And the second:

total-vram: 0
resident-vram: 0
total-system: 4 KiB
resident-system: 4 KiB

All with a single 4k BO, which happened to instantiate its backing store 
in it's 2nd preferred placement.

But IMO it is better than not accounting it under total anywhere in the 
first case.

Have I managed to explain it good enough?

Regards,

Tvrtko

>>>>
>>>> Then what Teddy does is IMO only tangential, he just changes when
>>>> stats are collected and not this aspect.
>>>
>>> Yeah, right but we should probably fix it up in the right way while on it.
>>
>> Okay, we just need to align on is there a problem and how to fix it.
>>
>>>> To fold or not the special placements (GWS, GDS & co) is also
>>>> tangential. In my patch I just preserved the legacy behaviour so it
>>>> can easily be tweaked on top.
>>>
>>> Yeah, but again the original behavior is completely broken.
>>>
>>> GWS, GDS and OA are counted in blocks of HW units (multiplied by
>>> PAGE_SIZE IIRC to avoid some GEM&TTM warnings).
>>>
>>> When you accumulate that anywhere in the memory stats then that is
>>> just completely off.
>>
>> Ooops. :) Are they backed by some memory though, be it system or VRAM?
>>
>> Regards,
>>
>> Tvrtko

Tvrtko Ursulin Oct. 24, 2024, 8:29 a.m. UTC | #18

On 23/10/2024 13:56, Christian König wrote:
> Am 23.10.24 um 14:24 schrieb Tvrtko Ursulin:
>> [SNIP]
>>>> To fold or not the special placements (GWS, GDS & co) is also 
>>>> tangential. In my patch I just preserved the legacy behaviour so it 
>>>> can easily be tweaked on top.
>>>
>>> Yeah, but again the original behavior is completely broken.
>>>
>>> GWS, GDS and OA are counted in blocks of HW units (multiplied by 
>>> PAGE_SIZE IIRC to avoid some GEM&TTM warnings).
>>>
>>> When you accumulate that anywhere in the memory stats then that is 
>>> just completely off.
>>
>> Ooops. :) Are they backed by some memory though, be it system or VRAM?
> 
> GDS is an internal 4 or 64KiB memory block which is only valid while 
> shaders are running. It is used to communicate stuff between different 
> shader stages and not even CPU accessible.
> 
> GWS and OA are not even memory, those are just HW blocks which implement 
> a fixed function.
> 
> IIRC most HW generation have 16 of each and when setting up the 
> application virtual address space you can specify how many will be used 
> by the application.

I see, thank you! Though I could have bothered to look in the code or 
even instrument at runtime too.

I agree removing it from system is correct. If wanted and/or desirable 
some or all could be exported as different memory regions even. DRM 
fdinfo specs already allows that. Like:

drm-total-vram: ...
drm-total-gds: ...
drm-total-oa: ...

Etc.

Regards,

Tvrtko

[v5,4/4] drm/amdgpu: track bo memory stats at runtime

Commit Message

Comments

Patch