Message ID | 20240307205054.3904657-3-sunil.khatri@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add pagefault support for devcoredump | expand |
Am 07.03.24 um 21:50 schrieb Sunil Khatri: > Add page fault information to the devcoredump. > > Output of devcoredump: > **** AMDGPU Device Coredump **** > version: 1 > kernel: 6.7.0-amd-staging-drm-next > module: amdgpu > time: 29.725011811 > process_name: soft_recovery_p PID: 1720 > > Ring timed out details > IP Type: 0 Ring Name: gfx_0.0.0 > > [gfxhub] Page fault observed > Faulty page starting at address: 0x0000000000000000 > Protection fault status register: 0x301031 > > VRAM is lost due to GPU reset! > > Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +++++++++++++- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > index 147100c27c2d..8794a3c21176 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, > coredump->ring->name); > } > > + if (coredump->adev) { > + struct amdgpu_vm_fault_info *fault_info = > + &coredump->adev->vm_manager.fault_info; > + > + drm_printf(&p, "\n[%s] Page fault observed\n", > + fault_info->vmhub ? "mmhub" : "gfxhub"); > + drm_printf(&p, "Faulty page starting at address: 0x%016llx\n", > + fault_info->addr); > + drm_printf(&p, "Protection fault status register: 0x%x\n", > + fault_info->status); > + } > + > if (coredump->reset_vram_lost) > - drm_printf(&p, "VRAM is lost due to GPU reset!\n"); > + drm_printf(&p, "\nVRAM is lost due to GPU reset!\n"); Why this additional new line? Apart from that looks really good to me. Regards, Christian. > if (coredump->adev->reset_info.num_regs) { > drm_printf(&p, "AMDGPU register dumps:\nOffset: Value:\n"); >
On 3/8/2024 2:39 PM, Christian König wrote: > Am 07.03.24 um 21:50 schrieb Sunil Khatri: >> Add page fault information to the devcoredump. >> >> Output of devcoredump: >> **** AMDGPU Device Coredump **** >> version: 1 >> kernel: 6.7.0-amd-staging-drm-next >> module: amdgpu >> time: 29.725011811 >> process_name: soft_recovery_p PID: 1720 >> >> Ring timed out details >> IP Type: 0 Ring Name: gfx_0.0.0 >> >> [gfxhub] Page fault observed >> Faulty page starting at address: 0x0000000000000000 >> Protection fault status register: 0x301031 >> >> VRAM is lost due to GPU reset! >> >> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +++++++++++++- >> 1 file changed, 13 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c >> index 147100c27c2d..8794a3c21176 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c >> @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t >> offset, size_t count, >> coredump->ring->name); >> } >> + if (coredump->adev) { >> + struct amdgpu_vm_fault_info *fault_info = >> + &coredump->adev->vm_manager.fault_info; >> + >> + drm_printf(&p, "\n[%s] Page fault observed\n", >> + fault_info->vmhub ? "mmhub" : "gfxhub"); >> + drm_printf(&p, "Faulty page starting at address: 0x%016llx\n", >> + fault_info->addr); >> + drm_printf(&p, "Protection fault status register: 0x%x\n", >> + fault_info->status); >> + } >> + >> if (coredump->reset_vram_lost) >> - drm_printf(&p, "VRAM is lost due to GPU reset!\n"); >> + drm_printf(&p, "\nVRAM is lost due to GPU reset!\n"); > > Why this additional new line? The intent is the devcoredump have different sections clearly demarcated with an new line else "VRAM is lost due to GPU reset!" seems part of the page fault information. [gfxhub] Page fault observed Faulty page starting at address: 0x0000000000000000 Protection fault status register: 0x301031 VRAM is lost due to GPU reset! Regards Sunil > > Apart from that looks really good to me. > > Regards, > Christian. > >> if (coredump->adev->reset_info.num_regs) { >> drm_printf(&p, "AMDGPU register dumps:\nOffset: >> Value:\n"); >
Am 08.03.24 um 10:16 schrieb Khatri, Sunil: > > On 3/8/2024 2:39 PM, Christian König wrote: >> Am 07.03.24 um 21:50 schrieb Sunil Khatri: >>> Add page fault information to the devcoredump. >>> >>> Output of devcoredump: >>> **** AMDGPU Device Coredump **** >>> version: 1 >>> kernel: 6.7.0-amd-staging-drm-next >>> module: amdgpu >>> time: 29.725011811 >>> process_name: soft_recovery_p PID: 1720 >>> >>> Ring timed out details >>> IP Type: 0 Ring Name: gfx_0.0.0 >>> >>> [gfxhub] Page fault observed >>> Faulty page starting at address: 0x0000000000000000 >>> Protection fault status register: 0x301031 >>> >>> VRAM is lost due to GPU reset! >>> >>> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +++++++++++++- >>> 1 file changed, 13 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c >>> index 147100c27c2d..8794a3c21176 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c >>> @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t >>> offset, size_t count, >>> coredump->ring->name); >>> } >>> + if (coredump->adev) { >>> + struct amdgpu_vm_fault_info *fault_info = >>> + &coredump->adev->vm_manager.fault_info; >>> + >>> + drm_printf(&p, "\n[%s] Page fault observed\n", >>> + fault_info->vmhub ? "mmhub" : "gfxhub"); >>> + drm_printf(&p, "Faulty page starting at address: 0x%016llx\n", >>> + fault_info->addr); >>> + drm_printf(&p, "Protection fault status register: 0x%x\n", >>> + fault_info->status); >>> + } >>> + >>> if (coredump->reset_vram_lost) >>> - drm_printf(&p, "VRAM is lost due to GPU reset!\n"); >>> + drm_printf(&p, "\nVRAM is lost due to GPU reset!\n"); >> >> Why this additional new line? > The intent is the devcoredump have different sections clearly > demarcated with an new line else "VRAM is lost due to GPU reset!" > seems part of the page fault information. > [gfxhub] Page fault observed > Faulty page starting at address: 0x0000000000000000 > Protection fault status register: 0x301031 > > VRAM is lost due to GPU reset! In that case I would print the newline independent if VRAM is lost or not. Otherwise you get: Protection fault status register:... VRAM is lost due to GPU reset! AMDGPU register dumps: In one case and: Protection fault status register:... AMDGPU register dumps: In the other case which breaks this sectioning quite a bit. Regards, Christian. > > Regards > Sunil > >> >> Apart from that looks really good to me. >> >> Regards, >> Christian. >> >>> if (coredump->adev->reset_info.num_regs) { >>> drm_printf(&p, "AMDGPU register dumps:\nOffset: >>> Value:\n"); >>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 147100c27c2d..8794a3c21176 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->ring->name); } + if (coredump->adev) { + struct amdgpu_vm_fault_info *fault_info = + &coredump->adev->vm_manager.fault_info; + + drm_printf(&p, "\n[%s] Page fault observed\n", + fault_info->vmhub ? "mmhub" : "gfxhub"); + drm_printf(&p, "Faulty page starting at address: 0x%016llx\n", + fault_info->addr); + drm_printf(&p, "Protection fault status register: 0x%x\n", + fault_info->status); + } + if (coredump->reset_vram_lost) - drm_printf(&p, "VRAM is lost due to GPU reset!\n"); + drm_printf(&p, "\nVRAM is lost due to GPU reset!\n"); if (coredump->adev->reset_info.num_regs) { drm_printf(&p, "AMDGPU register dumps:\nOffset: Value:\n");
Add page fault information to the devcoredump. Output of devcoredump: **** AMDGPU Device Coredump **** version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 29.725011811 process_name: soft_recovery_p PID: 1720 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 [gfxhub] Page fault observed Faulty page starting at address: 0x0000000000000000 Protection fault status register: 0x301031 VRAM is lost due to GPU reset! Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-)