Message ID | 20240307165932.3856952-3-sunil.khatri@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add pagefault support for devcoredump | expand |
On Thu, Mar 7, 2024 at 12:00 PM Sunil Khatri <sunil.khatri@amd.com> wrote: > > Add page fault information to the devcoredump. > > Output of devcoredump: > **** AMDGPU Device Coredump **** > version: 1 > kernel: 6.7.0-amd-staging-drm-next > module: amdgpu > time: 29.725011811 > process_name: soft_recovery_p PID: 1720 > > Ring timed out details > IP Type: 0 Ring Name: gfx_0.0.0 > > [gfxhub] Page fault observed > Faulty page starting at address 0x0000000000000000 Do you want a : before the address for consistency? > Protection fault status register:0x301031 How about a space after the : for consistency? For parsability, it may make more sense to just have a list of key value pairs: [GPU page fault] hub: addr: status: [Ring timeout details] IP: ring: name: etc. > > VRAM is lost due to GPU reset! > > Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +++++++++++++- > 1 file changed, 13 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > index 147100c27c2d..dd39e614d907 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, > coredump->ring->name); > } > > + if (coredump->adev) { > + struct amdgpu_vm_fault_info *fault_info = > + &coredump->adev->vm_manager.fault_info; > + > + drm_printf(&p, "\n[%s] Page fault observed\n", > + fault_info->vmhub ? "mmhub" : "gfxhub"); > + drm_printf(&p, "Faulty page starting at address 0x%016llx\n", > + fault_info->addr); > + drm_printf(&p, "Protection fault status register:0x%x\n", > + fault_info->status); > + } > + > if (coredump->reset_vram_lost) > - drm_printf(&p, "VRAM is lost due to GPU reset!\n"); > + drm_printf(&p, "\nVRAM is lost due to GPU reset!\n"); > if (coredump->adev->reset_info.num_regs) { > drm_printf(&p, "AMDGPU register dumps:\nOffset: Value:\n"); > > -- > 2.34.1 >
On 3/8/2024 12:44 AM, Alex Deucher wrote: > On Thu, Mar 7, 2024 at 12:00 PM Sunil Khatri <sunil.khatri@amd.com> wrote: >> Add page fault information to the devcoredump. >> >> Output of devcoredump: >> **** AMDGPU Device Coredump **** >> version: 1 >> kernel: 6.7.0-amd-staging-drm-next >> module: amdgpu >> time: 29.725011811 >> process_name: soft_recovery_p PID: 1720 >> >> Ring timed out details >> IP Type: 0 Ring Name: gfx_0.0.0 >> >> [gfxhub] Page fault observed >> Faulty page starting at address 0x0000000000000000 > Do you want a : before the address for consistency? sure. > >> Protection fault status register:0x301031 > How about a space after the : for consistency? > > For parsability, it may make more sense to just have a list of key value pairs: > [GPU page fault] > hub: > addr: > status: > [Ring timeout details] > IP: > ring: > name: > > etc. Sure i agree but till now i was capturing information like we shared in dmesg which is user readable. But surely one we have enough data i could arrange all in key: value pairs like you suggest in a patch later if that works ? > >> VRAM is lost due to GPU reset! >> >> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +++++++++++++- >> 1 file changed, 13 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c >> index 147100c27c2d..dd39e614d907 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c >> @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, >> coredump->ring->name); >> } >> >> + if (coredump->adev) { >> + struct amdgpu_vm_fault_info *fault_info = >> + &coredump->adev->vm_manager.fault_info; >> + >> + drm_printf(&p, "\n[%s] Page fault observed\n", >> + fault_info->vmhub ? "mmhub" : "gfxhub"); >> + drm_printf(&p, "Faulty page starting at address 0x%016llx\n", >> + fault_info->addr); >> + drm_printf(&p, "Protection fault status register:0x%x\n", >> + fault_info->status); >> + } >> + >> if (coredump->reset_vram_lost) >> - drm_printf(&p, "VRAM is lost due to GPU reset!\n"); >> + drm_printf(&p, "\nVRAM is lost due to GPU reset!\n"); >> if (coredump->adev->reset_info.num_regs) { >> drm_printf(&p, "AMDGPU register dumps:\nOffset: Value:\n"); >> >> -- >> 2.34.1 >>
On Thu, Mar 7, 2024 at 3:42 PM Khatri, Sunil <sukhatri@amd.com> wrote: > > > On 3/8/2024 12:44 AM, Alex Deucher wrote: > > On Thu, Mar 7, 2024 at 12:00 PM Sunil Khatri <sunil.khatri@amd.com> wrote: > >> Add page fault information to the devcoredump. > >> > >> Output of devcoredump: > >> **** AMDGPU Device Coredump **** > >> version: 1 > >> kernel: 6.7.0-amd-staging-drm-next > >> module: amdgpu > >> time: 29.725011811 > >> process_name: soft_recovery_p PID: 1720 > >> > >> Ring timed out details > >> IP Type: 0 Ring Name: gfx_0.0.0 > >> > >> [gfxhub] Page fault observed > >> Faulty page starting at address 0x0000000000000000 > > Do you want a : before the address for consistency? > sure. > > > >> Protection fault status register:0x301031 > > How about a space after the : for consistency? > > > > For parsability, it may make more sense to just have a list of key value pairs: > > [GPU page fault] > > hub: > > addr: > > status: > > [Ring timeout details] > > IP: > > ring: > > name: > > > > etc. > > Sure i agree but till now i was capturing information like we shared in > dmesg which is user readable. But surely one we have enough data i could > arrange all in key: value pairs like you suggest in a patch later if > that works ? Sure. Alex > > > > >> VRAM is lost due to GPU reset! > >> > >> Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> > >> --- > >> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +++++++++++++- > >> 1 file changed, 13 insertions(+), 1 deletion(-) > >> > >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > >> index 147100c27c2d..dd39e614d907 100644 > >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > >> @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, > >> coredump->ring->name); > >> } > >> > >> + if (coredump->adev) { > >> + struct amdgpu_vm_fault_info *fault_info = > >> + &coredump->adev->vm_manager.fault_info; > >> + > >> + drm_printf(&p, "\n[%s] Page fault observed\n", > >> + fault_info->vmhub ? "mmhub" : "gfxhub"); > >> + drm_printf(&p, "Faulty page starting at address 0x%016llx\n", > >> + fault_info->addr); > >> + drm_printf(&p, "Protection fault status register:0x%x\n", > >> + fault_info->status); > >> + } > >> + > >> if (coredump->reset_vram_lost) > >> - drm_printf(&p, "VRAM is lost due to GPU reset!\n"); > >> + drm_printf(&p, "\nVRAM is lost due to GPU reset!\n"); > >> if (coredump->adev->reset_info.num_regs) { > >> drm_printf(&p, "AMDGPU register dumps:\nOffset: Value:\n"); > >> > >> -- > >> 2.34.1 > >>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c index 147100c27c2d..dd39e614d907 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c @@ -203,8 +203,20 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, coredump->ring->name); } + if (coredump->adev) { + struct amdgpu_vm_fault_info *fault_info = + &coredump->adev->vm_manager.fault_info; + + drm_printf(&p, "\n[%s] Page fault observed\n", + fault_info->vmhub ? "mmhub" : "gfxhub"); + drm_printf(&p, "Faulty page starting at address 0x%016llx\n", + fault_info->addr); + drm_printf(&p, "Protection fault status register:0x%x\n", + fault_info->status); + } + if (coredump->reset_vram_lost) - drm_printf(&p, "VRAM is lost due to GPU reset!\n"); + drm_printf(&p, "\nVRAM is lost due to GPU reset!\n"); if (coredump->adev->reset_info.num_regs) { drm_printf(&p, "AMDGPU register dumps:\nOffset: Value:\n");
Add page fault information to the devcoredump. Output of devcoredump: **** AMDGPU Device Coredump **** version: 1 kernel: 6.7.0-amd-staging-drm-next module: amdgpu time: 29.725011811 process_name: soft_recovery_p PID: 1720 Ring timed out details IP Type: 0 Ring Name: gfx_0.0.0 [gfxhub] Page fault observed Faulty page starting at address 0x0000000000000000 Protection fault status register:0x301031 VRAM is lost due to GPU reset! Signed-off-by: Sunil Khatri <sunil.khatri@amd.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-)