[RFC,v3,06/12] drm/amdgpu: Drop hive->in_reset

Message ID	20220125223752.200211-7-andrey.grodzovsky@amd.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; From: Andrey Grodzovsky <andrey.grodzovsky@amd.com> To: <dri-devel@lists.freedesktop.org>, <amd-gfx@lists.freedesktop.org> Subject: [RFC v3 06/12] drm/amdgpu: Drop hive->in_reset Date: Tue, 25 Jan 2022 17:37:46 -0500 Message-ID: <20220125223752.200211-7-andrey.grodzovsky@amd.com> In-Reply-To: <20220125223752.200211-1-andrey.grodzovsky@amd.com> References: <20220125223752.200211-1-andrey.grodzovsky@amd.com> MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit Precedence: list Cc: horace.chen@amd.com, lijo.lazar@amd.com, jingwech@amd.com, christian.koenig@amd.com, Monk.Liu@amd.com Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	Define and use reset domain for GPU recovery in amdgpu \| expand [RFC,v3,00/12] Define and use reset domain for GPU recovery in amdgpu [RFC,v3,01/12] drm/amdgpu: Introduce reset domain [RFC,v3,02/12] drm/amdgpu: Move scheduler init to after XGMI is ready [RFC,v3,03/12] drm/amdgpu: Fix crash on modprobe [RFC,v3,04/12] drm/amdgpu: Serialize non TDR gpu recovery with TDRs [RFC,v3,05/12] drm/amd/virt: For SRIOV send GPU reset directly to TDR queue. [RFC,v3,06/12] drm/amdgpu: Drop hive->in_reset [RFC,v3,07/12] drm/amdgpu: Drop concurrent GPU reset protection for device [RFC,v3,08/12] drm/amdgpu: Rework reset domain to be refcounted. [RFC,v3,09/12] drm/amdgpu: Move reset sem into reset_domain [RFC,v3,10/12] drm/amdgpu: Move in_gpu_reset into reset_domain [RFC,v3,11/12] drm/amdgpu: Rework amdgpu_device_lock_adev [RFC,v3,12/12] Revert 'drm/amdgpu: annotate a false positive recursive locking'

Message ID

20220125223752.200211-7-andrey.grodzovsky@amd.com (mailing list archive)

State

New, archived

Headers

Received-SPF: Pass (protection.outlook.com: domain of amd.com designates
 165.204.84.17 as permitted sender) receiver=protection.outlook.com;
 client-ip=165.204.84.17; helo=SATLEXMB04.amd.com;
From: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
To: <dri-devel@lists.freedesktop.org>, <amd-gfx@lists.freedesktop.org>
Subject: [RFC v3 06/12] drm/amdgpu: Drop hive->in_reset
Date: Tue, 25 Jan 2022 17:37:46 -0500
Message-ID: <20220125223752.200211-7-andrey.grodzovsky@amd.com>
In-Reply-To: <20220125223752.200211-1-andrey.grodzovsky@amd.com>
References: <20220125223752.200211-1-andrey.grodzovsky@amd.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8bit
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 25 Jan 2022 22:38:26.0765 (UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 3d2df4c8-bfc9-4753-d814-08d9e05366d3
X-MS-Exchange-CrossTenant-Id: 3dd8961f-e488-4e60-8e11-a82d994e183d
X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: 
 TenantId=3dd8961f-e488-4e60-8e11-a82d994e183d; Ip=[165.204.84.17];
 Helo=[SATLEXMB04.amd.com]
X-MS-Exchange-CrossTenant-AuthSource: 
 DM6NAM11FT021.eop-nam11.prod.protection.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Anonymous
X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM5PR12MB2472
X-BeenThere: dri-devel@lists.freedesktop.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Direct Rendering Infrastructure - Development
 <dri-devel.lists.freedesktop.org>
List-Unsubscribe: <https://lists.freedesktop.org/mailman/options/dri-devel>,
 <mailto:dri-devel-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <https://lists.freedesktop.org/archives/dri-devel>
List-Post: <mailto:dri-devel@lists.freedesktop.org>
List-Help: <mailto:dri-devel-request@lists.freedesktop.org?subject=help>
List-Subscribe: <https://lists.freedesktop.org/mailman/listinfo/dri-devel>,
 <mailto:dri-devel-request@lists.freedesktop.org?subject=subscribe>
Cc: horace.chen@amd.com, lijo.lazar@amd.com, jingwech@amd.com,
 christian.koenig@amd.com, Monk.Liu@amd.com
Errors-To: dri-devel-bounces@lists.freedesktop.org
Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

Series

Define and use reset domain for GPU recovery in amdgpu | expand

Commit Message

Andrey Grodzovsky Jan. 25, 2022, 10:37 p.m. UTC

Since we serialize all resets no need to protect from concurrent
resets.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +------------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  1 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  1 -
 3 files changed, 1 insertion(+), 20 deletions(-)

Comments

Lazar, Lijo Feb. 8, 2022, 6:33 a.m. UTC | #1

On 1/26/2022 4:07 AM, Andrey Grodzovsky wrote:
> Since we serialize all resets no need to protect from concurrent
> resets.
> 
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> Reviewed-by: Christian König <christian.koenig@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +------------------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  1 -
>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  1 -
>   3 files changed, 1 insertion(+), 20 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 258ec3c0b2af..107a393ebbfd 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5013,25 +5013,9 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>   	dev_info(adev->dev, "GPU %s begin!\n",
>   		need_emergency_restart ? "jobs stop":"reset");
>   
> -	/*
> -	 * Here we trylock to avoid chain of resets executing from
> -	 * either trigger by jobs on different adevs in XGMI hive or jobs on
> -	 * different schedulers for same device while this TO handler is running.
> -	 * We always reset all schedulers for device and all devices for XGMI
> -	 * hive so that should take care of them too.
> -	 */
>   	hive = amdgpu_get_xgmi_hive(adev);
> -	if (hive) {
> -		if (atomic_cmpxchg(&hive->in_reset, 0, 1) != 0) {
> -			DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress",
> -				job ? job->base.id : -1, hive->hive_id);
> -			amdgpu_put_xgmi_hive(hive);
> -			if (job && job->vm)
> -				drm_sched_increase_karma(&job->base);
> -			return 0;
> -		}

This function in general will reset all devices in a hive.

In a situation like GPU0 in hive0 gets to this function first and GPU1 
in hive0 also hangs shortly (before GPU0 recovery process starts 
reseting other devices in hive), we don't want to execute work queued as 
part of GPU1's recovery also.Both GPU0 and GPU1 recovery process will 
try to reset all the devices in hive.

In short - if a reset domain is already active, probably we don't need 
to queue another work to the domain since all devices in the domain are 
expected to get reset shortly.

Thanks,
Lijo

> +	if (hive)
>   		mutex_lock(&hive->hive_lock);
> -	}
>   
>   	reset_context.method = AMD_RESET_METHOD_NONE;
>   	reset_context.reset_req_dev = adev;
> @@ -5227,7 +5211,6 @@ int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
>   
>   skip_recovery:
>   	if (hive) {
> -		atomic_set(&hive->in_reset, 0);
>   		mutex_unlock(&hive->hive_lock);
>   		amdgpu_put_xgmi_hive(hive);
>   	}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> index a858e3457c5c..9ad742039ac9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
> @@ -404,7 +404,6 @@ struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev)
>   	INIT_LIST_HEAD(&hive->device_list);
>   	INIT_LIST_HEAD(&hive->node);
>   	mutex_init(&hive->hive_lock);
> -	atomic_set(&hive->in_reset, 0);
>   	atomic_set(&hive->number_devices, 0);
>   	task_barrier_init(&hive->tb);
>   	hive->pstate = AMDGPU_XGMI_PSTATE_UNKNOWN;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
> index 6121aaa292cb..2f2ce53645a5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
> @@ -33,7 +33,6 @@ struct amdgpu_hive_info {
>   	struct list_head node;
>   	atomic_t number_devices;
>   	struct mutex hive_lock;
> -	atomic_t in_reset;
>   	int hi_req_count;
>   	struct amdgpu_device *hi_req_gpu;
>   	struct task_barrier tb;
>

Andrey Grodzovsky Feb. 8, 2022, 3:39 p.m. UTC | #2

On 2022-02-08 01:33, Lazar, Lijo wrote:
>
>
> On 1/26/2022 4:07 AM, Andrey Grodzovsky wrote:
>> Since we serialize all resets no need to protect from concurrent
>> resets.
>>
>> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>> Reviewed-by: Christian König <christian.koenig@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 19 +------------------
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c   |  1 -
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h   |  1 -
>>   3 files changed, 1 insertion(+), 20 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 258ec3c0b2af..107a393ebbfd 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -5013,25 +5013,9 @@ int amdgpu_device_gpu_recover_imp(struct 
>> amdgpu_device *adev,
>>       dev_info(adev->dev, "GPU %s begin!\n",
>>           need_emergency_restart ? "jobs stop":"reset");
>>   -    /*
>> -     * Here we trylock to avoid chain of resets executing from
>> -     * either trigger by jobs on different adevs in XGMI hive or 
>> jobs on
>> -     * different schedulers for same device while this TO handler is 
>> running.
>> -     * We always reset all schedulers for device and all devices for 
>> XGMI
>> -     * hive so that should take care of them too.
>> -     */
>>       hive = amdgpu_get_xgmi_hive(adev);
>> -    if (hive) {
>> -        if (atomic_cmpxchg(&hive->in_reset, 0, 1) != 0) {
>> -            DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as 
>> another already in progress",
>> -                job ? job->base.id : -1, hive->hive_id);
>> -            amdgpu_put_xgmi_hive(hive);
>> -            if (job && job->vm)
>> -                drm_sched_increase_karma(&job->base);
>> -            return 0;
>> -        }
>
> This function in general will reset all devices in a hive.
>
> In a situation like GPU0 in hive0 gets to this function first and GPU1 
> in hive0 also hangs shortly (before GPU0 recovery process starts 
> reseting other devices in hive), we don't want to execute work queued 
> as part of GPU1's recovery also.Both GPU0 and GPU1 recovery process 
> will try to reset all the devices in hive.
>
> In short - if a reset domain is already active, probably we don't need 
> to queue another work to the domain since all devices in the domain 
> are expected to get reset shortly.
>
> Thanks,
> Lijo


No worries for this - check this part in drm_sched_stop 
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/scheduler/sched_main.c#L452
this will be called for each scheduler participating in the reset domain 
(including schedulers of GPU) and will cancel any such pending resets that
we want to avoid executing.

Andrey


>
>> +    if (hive)
>>           mutex_lock(&hive->hive_lock);
>> -    }
>>         reset_context.method = AMD_RESET_METHOD_NONE;
>>       reset_context.reset_req_dev = adev;
>> @@ -5227,7 +5211,6 @@ int amdgpu_device_gpu_recover_imp(struct 
>> amdgpu_device *adev,
>>     skip_recovery:
>>       if (hive) {
>> -        atomic_set(&hive->in_reset, 0);
>>           mutex_unlock(&hive->hive_lock);
>>           amdgpu_put_xgmi_hive(hive);
>>       }
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>> index a858e3457c5c..9ad742039ac9 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
>> @@ -404,7 +404,6 @@ struct amdgpu_hive_info 
>> *amdgpu_get_xgmi_hive(struct amdgpu_device *adev)
>>       INIT_LIST_HEAD(&hive->device_list);
>>       INIT_LIST_HEAD(&hive->node);
>>       mutex_init(&hive->hive_lock);
>> -    atomic_set(&hive->in_reset, 0);
>>       atomic_set(&hive->number_devices, 0);
>>       task_barrier_init(&hive->tb);
>>       hive->pstate = AMDGPU_XGMI_PSTATE_UNKNOWN;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
>> index 6121aaa292cb..2f2ce53645a5 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
>> @@ -33,7 +33,6 @@ struct amdgpu_hive_info {
>>       struct list_head node;
>>       atomic_t number_devices;
>>       struct mutex hive_lock;
>> -    atomic_t in_reset;
>>       int hi_req_count;
>>       struct amdgpu_device *hi_req_gpu;
>>       struct task_barrier tb;
>>

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 258ec3c0b2af..107a393ebbfd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5013,25 +5013,9 @@  int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
 	dev_info(adev->dev, "GPU %s begin!\n",
 		need_emergency_restart ? "jobs stop":"reset");
 
-	/*
-	 * Here we trylock to avoid chain of resets executing from
-	 * either trigger by jobs on different adevs in XGMI hive or jobs on
-	 * different schedulers for same device while this TO handler is running.
-	 * We always reset all schedulers for device and all devices for XGMI
-	 * hive so that should take care of them too.
-	 */
 	hive = amdgpu_get_xgmi_hive(adev);
-	if (hive) {
-		if (atomic_cmpxchg(&hive->in_reset, 0, 1) != 0) {
-			DRM_INFO("Bailing on TDR for s_job:%llx, hive: %llx as another already in progress",
-				job ? job->base.id : -1, hive->hive_id);
-			amdgpu_put_xgmi_hive(hive);
-			if (job && job->vm)
-				drm_sched_increase_karma(&job->base);
-			return 0;
-		}
+	if (hive)
 		mutex_lock(&hive->hive_lock);
-	}
 
 	reset_context.method = AMD_RESET_METHOD_NONE;
 	reset_context.reset_req_dev = adev;
@@ -5227,7 +5211,6 @@  int amdgpu_device_gpu_recover_imp(struct amdgpu_device *adev,
 
 skip_recovery:
 	if (hive) {
-		atomic_set(&hive->in_reset, 0);
 		mutex_unlock(&hive->hive_lock);
 		amdgpu_put_xgmi_hive(hive);
 	}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
index a858e3457c5c..9ad742039ac9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
@@ -404,7 +404,6 @@  struct amdgpu_hive_info *amdgpu_get_xgmi_hive(struct amdgpu_device *adev)
 	INIT_LIST_HEAD(&hive->device_list);
 	INIT_LIST_HEAD(&hive->node);
 	mutex_init(&hive->hive_lock);
-	atomic_set(&hive->in_reset, 0);
 	atomic_set(&hive->number_devices, 0);
 	task_barrier_init(&hive->tb);
 	hive->pstate = AMDGPU_XGMI_PSTATE_UNKNOWN;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
index 6121aaa292cb..2f2ce53645a5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h
@@ -33,7 +33,6 @@  struct amdgpu_hive_info {
 	struct list_head node;
 	atomic_t number_devices;
 	struct mutex hive_lock;
-	atomic_t in_reset;
 	int hi_req_count;
 	struct amdgpu_device *hi_req_gpu;
 	struct task_barrier tb;

[RFC,v3,06/12] drm/amdgpu: Drop hive->in_reset

Commit Message

Comments

Patch