From patchwork Thu Mar 7 20:50:53 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sunil Khatri X-Patchwork-Id: 13586263 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 0712DC54798 for ; Thu, 7 Mar 2024 20:51:07 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id CCA9410F56D; Thu, 7 Mar 2024 20:51:05 +0000 (UTC) Received: from rtg-sunil-navi33.amd.com (unknown [165.204.156.251]) by gabe.freedesktop.org (Postfix) with ESMTPS id 310F410F568; Thu, 7 Mar 2024 20:51:02 +0000 (UTC) Received: from rtg-sunil-navi33.amd.com (localhost [127.0.0.1]) by rtg-sunil-navi33.amd.com (8.15.2/8.15.2/Debian-22ubuntu3) with ESMTP id 427KoviT3904698; Fri, 8 Mar 2024 02:20:57 +0530 Received: (from sunil@localhost) by rtg-sunil-navi33.amd.com (8.15.2/8.15.2/Submit) id 427KovQR3904697; Fri, 8 Mar 2024 02:20:57 +0530 From: Sunil Khatri To: Alex Deucher , =?utf-8?q?Christian_K=C3=B6nig?= , Shashank Sharma Cc: amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org, Mukul Joshi , Arunpravin Paneer Selvam , Sunil Khatri Subject: [PATCH v2 1/2] drm/amdgpu: add recent pagefault info in vm_manager Date: Fri, 8 Mar 2024 02:20:53 +0530 Message-Id: <20240307205054.3904657-2-sunil.khatri@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240307205054.3904657-1-sunil.khatri@amd.com> References: <20240307205054.3904657-1-sunil.khatri@amd.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Currently page fault information is stored per vm and which could be freed or stale during reset. Add it pagefault information in the vm_manager which is a global space for vm's and remains valid across. Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 8 ++++++++ drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 2 ++ 2 files changed, 10 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 4299ce386322..81fb3465e197 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -2924,6 +2924,14 @@ void amdgpu_vm_update_fault_cache(struct amdgpu_device *adev, if (vm && status) { vm->fault_info.addr = addr; vm->fault_info.status = status; + /* + * Update the fault information globally for later usage + * when vm could be stale or freed. + */ + adev->vm_manager.fault_info.addr = addr; + adev->vm_manager.fault_info.vmhub = vmhub; + adev->vm_manager.fault_info.status = status; + if (AMDGPU_IS_GFXHUB(vmhub)) { vm->fault_info.vmhub = AMDGPU_VMHUB_TYPE_GFX; vm->fault_info.vmhub |= diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h index 047ec1930d12..8efa8422f4f7 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h @@ -422,6 +422,8 @@ struct amdgpu_vm_manager { * look up VM of a page fault */ struct xarray pasids; + /* Global registration of recent page fault information */ + struct amdgpu_vm_fault_info fault_info; }; struct amdgpu_bo_va_mapping; From patchwork Thu Mar 7 20:50:54 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sunil Khatri X-Patchwork-Id: 13586265 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 48D00C54798 for ; Thu, 7 Mar 2024 20:51:13 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 1BC9F10F7C7; Thu, 7 Mar 2024 20:51:12 +0000 (UTC) Received: from rtg-sunil-navi33.amd.com (unknown [165.204.156.251]) by gabe.freedesktop.org (Postfix) with ESMTPS id 3968510F4F8; Thu, 7 Mar 2024 20:51:03 +0000 (UTC) Received: from rtg-sunil-navi33.amd.com (localhost [127.0.0.1]) by rtg-sunil-navi33.amd.com (8.15.2/8.15.2/Debian-22ubuntu3) with ESMTP id 427KowkS3904709; Fri, 8 Mar 2024 02:20:58 +0530 Received: (from sunil@localhost) by rtg-sunil-navi33.amd.com (8.15.2/8.15.2/Submit) id 427KowqA3904708; Fri, 8 Mar 2024 02:20:58 +0530 From: Sunil Khatri To: Alex Deucher , =?utf-8?q?Christian_K=C3=B6nig?= , Shashank Sharma Cc: amd-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org, linux-kernel@vger.kernel.org, Mukul Joshi , Arunpravin Paneer Selvam , Sunil Khatri Subject: [PATCH v2 2/2] drm/amdgpu: add vm fault information to devcoredump Date: Fri, 8 Mar 2024 02:20:54 +0530 Message-Id: <20240307205054.3904657-3-sunil.khatri@amd.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240307205054.3904657-1-sunil.khatri@amd.com> References: <20240307205054.3904657-1-sunil.khatri@amd.com> MIME-Version: 1.0 X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" Add page fault information to the devcoredump. Output of devcoredump: **** AMDGPU Device Coredump **** version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 29.725011811 process_name: soft_recovery_p PID: 1720 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 [gfxhub] Page fault observed Faulty page starting at address: 0x0000000000000000 Protection fault status register: 0x301031 VRAM is lost due to GPU reset! Signed-off-by: Sunil Khatri --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 147100c27c2d..8794a3c21176 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->ring->name); } + if (coredump->adev) { + struct amdgpu_vm_fault_info *fault_info = + &coredump->adev->vm_manager.fault_info; + + drm_printf(&p, "\n[%s] Page fault observed\n", + fault_info->vmhub ? "mmhub" : "gfxhub"); + drm_printf(&p, "Faulty page starting at address: 0x%016llx\n", + fault_info->addr); + drm_printf(&p, "Protection fault status register: 0x%x\n", + fault_info->status); + } + if (coredump->reset_vram_lost) - drm_printf(&p, "VRAM is lost due to GPU reset!\n"); + drm_printf(&p, "\nVRAM is lost due to GPU reset!\n"); if (coredump->adev->reset_info.num_regs) { drm_printf(&p, "AMDGPU register dumps:\nOffset: Value:\n");