Message ID | 20241212190909.28559-2-andrealmeid@igalia.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | drm/amdgpu: Use device wedged event | expand |
Hi Christian, Em 13/12/2024 04:34, Christian König escreveu: > Am 12.12.24 um 20:09 schrieb André Almeida: >> Use DRM's device wedged event to notify userspace that a reset had >> happened. For now, only use `none` method meant for telemetry >> capture. >> >> Signed-off-by: André Almeida <andrealmeid@igalia.com> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/ >> drm/amd/amdgpu/amdgpu_device.c >> index 96316111300a..19e1a5493778 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct >> amdgpu_device *adev, >> dev_info(adev->dev, "GPU reset end with ret = %d\n", r); >> atomic_set(&adev->reset_domain->reset_res, r); >> + >> + drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE); > > That looks really good in general. I would just make the > DRM_WEDGE_RECOVERY_NONE depend on the value of "r". > Why depend or `r`? A reset was triggered anyway, regardless of the success of it, shouldn't we tell userspace?
On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote: > Hi Christian, > > Em 13/12/2024 04:34, Christian König escreveu: > > Am 12.12.24 um 20:09 schrieb André Almeida: > > > Use DRM's device wedged event to notify userspace that a reset had > > > happened. For now, only use `none` method meant for telemetry > > > capture. > > > > > > Signed-off-by: André Almeida <andrealmeid@igalia.com> > > > --- > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++ > > > 1 file changed, 3 insertions(+) > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c > > > index 96316111300a..19e1a5493778 100644 > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct > > > amdgpu_device *adev, > > > dev_info(adev->dev, "GPU reset end with ret = %d\n", r); > > > atomic_set(&adev->reset_domain->reset_res, r); > > > + > > > + drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE); > > > > That looks really good in general. I would just make the > > DRM_WEDGE_RECOVERY_NONE depend on the value of "r". > > > > Why depend or `r`? A reset was triggered anyway, regardless of the success > of it, shouldn't we tell userspace? A failed reset would perhaps result in wedging, atleast that's how i915 is handling it. Raag
Em 13/12/2024 11:36, Raag Jadav escreveu: > On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote: >> Hi Christian, >> >> Em 13/12/2024 04:34, Christian König escreveu: >>> Am 12.12.24 um 20:09 schrieb André Almeida: >>>> Use DRM's device wedged event to notify userspace that a reset had >>>> happened. For now, only use `none` method meant for telemetry >>>> capture. >>>> >>>> Signed-off-by: André Almeida <andrealmeid@igalia.com> >>>> --- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++ >>>> 1 file changed, 3 insertions(+) >>>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c >>>> index 96316111300a..19e1a5493778 100644 >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct >>>> amdgpu_device *adev, >>>> dev_info(adev->dev, "GPU reset end with ret = %d\n", r); >>>> atomic_set(&adev->reset_domain->reset_res, r); >>>> + >>>> + drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE); >>> >>> That looks really good in general. I would just make the >>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r". >>> >> >> Why depend or `r`? A reset was triggered anyway, regardless of the success >> of it, shouldn't we tell userspace? > > A failed reset would perhaps result in wedging, atleast that's how i915 > is handling it. > Right, and I think this raises the question of what wedge recovery method should I add for amdgpu... Christian?
Am 13.12.24 um 16:56 schrieb André Almeida: > Em 13/12/2024 11:36, Raag Jadav escreveu: >> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote: >>> Hi Christian, >>> >>> Em 13/12/2024 04:34, Christian König escreveu: >>>> Am 12.12.24 um 20:09 schrieb André Almeida: >>>>> Use DRM's device wedged event to notify userspace that a reset had >>>>> happened. For now, only use `none` method meant for telemetry >>>>> capture. >>>>> >>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com> >>>>> --- >>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++ >>>>> 1 file changed, 3 insertions(+) >>>>> >>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c >>>>> index 96316111300a..19e1a5493778 100644 >>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct >>>>> amdgpu_device *adev, >>>>> dev_info(adev->dev, "GPU reset end with ret = %d\n", r); >>>>> atomic_set(&adev->reset_domain->reset_res, r); >>>>> + >>>>> + drm_dev_wedged_event(adev_to_drm(adev), >>>>> DRM_WEDGE_RECOVERY_NONE); >>>> >>>> That looks really good in general. I would just make the >>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r". >>>> >>> >>> Why depend or `r`? A reset was triggered anyway, regardless of the >>> success >>> of it, shouldn't we tell userspace? >> >> A failed reset would perhaps result in wedging, atleast that's how i915 >> is handling it. >> > > Right, and I think this raises the question of what wedge recovery > method should I add for amdgpu... Christian? > In theory a rebind should be enough to get the device going again, our BOCO does a bus reset on driver load anyway. Regards, Christian.
On 12/16/2024 3:48 PM, Christian König wrote: > Am 13.12.24 um 16:56 schrieb André Almeida: >> Em 13/12/2024 11:36, Raag Jadav escreveu: >>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote: >>>> Hi Christian, >>>> >>>> Em 13/12/2024 04:34, Christian König escreveu: >>>>> Am 12.12.24 um 20:09 schrieb André Almeida: >>>>>> Use DRM's device wedged event to notify userspace that a reset had >>>>>> happened. For now, only use `none` method meant for telemetry >>>>>> capture. >>>>>> >>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com> >>>>>> --- >>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++ >>>>>> 1 file changed, 3 insertions(+) >>>>>> >>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c >>>>>> index 96316111300a..19e1a5493778 100644 >>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct >>>>>> amdgpu_device *adev, >>>>>> dev_info(adev->dev, "GPU reset end with ret = %d\n", r); >>>>>> atomic_set(&adev->reset_domain->reset_res, r); >>>>>> + >>>>>> + drm_dev_wedged_event(adev_to_drm(adev), >>>>>> DRM_WEDGE_RECOVERY_NONE); >>>>> >>>>> That looks really good in general. I would just make the >>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r". >>>>> >>>> >>>> Why depend or `r`? A reset was triggered anyway, regardless of the >>>> success >>>> of it, shouldn't we tell userspace? >>> >>> A failed reset would perhaps result in wedging, atleast that's how i915 >>> is handling it. >>> >> >> Right, and I think this raises the question of what wedge recovery >> method should I add for amdgpu... Christian? >> > > In theory a rebind should be enough to get the device going again, our > BOCO does a bus reset on driver load anyway. > The behavior varies between SOCs. In certain ones, if driver reset fails, that means it's really in a bad state and it would need system reboot. I had asked earlier about the utility of this one here. If this is just to inform userspace that driver has done a reset and recovered, it would need some additional context also. We have a mechanism in KFD which sends the context in which a reset has to be done. Currently, that's restricted to compute applications, but if this is in a similar line, we would like to pass some additional info like job timeout, RAS error etc. Thanks, Lijo > Regards, > Christian.
Em 16/12/2024 07:38, Lazar, Lijo escreveu: > > > On 12/16/2024 3:48 PM, Christian König wrote: >> Am 13.12.24 um 16:56 schrieb André Almeida: >>> Em 13/12/2024 11:36, Raag Jadav escreveu: >>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote: >>>>> Hi Christian, >>>>> >>>>> Em 13/12/2024 04:34, Christian König escreveu: >>>>>> Am 12.12.24 um 20:09 schrieb André Almeida: >>>>>>> Use DRM's device wedged event to notify userspace that a reset had >>>>>>> happened. For now, only use `none` method meant for telemetry >>>>>>> capture. >>>>>>> >>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com> >>>>>>> --- >>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++ >>>>>>> 1 file changed, 3 insertions(+) >>>>>>> >>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c >>>>>>> index 96316111300a..19e1a5493778 100644 >>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct >>>>>>> amdgpu_device *adev, >>>>>>> dev_info(adev->dev, "GPU reset end with ret = %d\n", r); >>>>>>> atomic_set(&adev->reset_domain->reset_res, r); >>>>>>> + >>>>>>> + drm_dev_wedged_event(adev_to_drm(adev), >>>>>>> DRM_WEDGE_RECOVERY_NONE); >>>>>> >>>>>> That looks really good in general. I would just make the >>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r". >>>>>> >>>>> >>>>> Why depend or `r`? A reset was triggered anyway, regardless of the >>>>> success >>>>> of it, shouldn't we tell userspace? >>>> >>>> A failed reset would perhaps result in wedging, atleast that's how i915 >>>> is handling it. >>>> >>> >>> Right, and I think this raises the question of what wedge recovery >>> method should I add for amdgpu... Christian? >>> >> >> In theory a rebind should be enough to get the device going again, our >> BOCO does a bus reset on driver load anyway. >> > > The behavior varies between SOCs. In certain ones, if driver reset > fails, that means it's really in a bad state and it would need system > reboot. > Is this documented somewhere? Then I could even add a DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario. > I had asked earlier about the utility of this one here. If this is just > to inform userspace that driver has done a reset and recovered, it would > need some additional context also. We have a mechanism in KFD which > sends the context in which a reset has to be done. Currently, that's > restricted to compute applications, but if this is in a similar line, we > would like to pass some additional info like job timeout, RAS error etc. > DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a reset and recovered, but additional data about like which job timeout, RAS error and such belong to devcoredump I guess, where all data is gathered and collected later. > Thanks, > Lijo > >> Regards, >> Christian. >
Am 16.12.24 um 14:04 schrieb André Almeida: > Em 16/12/2024 07:38, Lazar, Lijo escreveu: >> >> >> On 12/16/2024 3:48 PM, Christian König wrote: >>> Am 13.12.24 um 16:56 schrieb André Almeida: >>>> Em 13/12/2024 11:36, Raag Jadav escreveu: >>>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote: >>>>>> Hi Christian, >>>>>> >>>>>> Em 13/12/2024 04:34, Christian König escreveu: >>>>>>> Am 12.12.24 um 20:09 schrieb André Almeida: >>>>>>>> Use DRM's device wedged event to notify userspace that a reset had >>>>>>>> happened. For now, only use `none` method meant for telemetry >>>>>>>> capture. >>>>>>>> >>>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com> >>>>>>>> --- >>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++ >>>>>>>> 1 file changed, 3 insertions(+) >>>>>>>> >>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c >>>>>>>> index 96316111300a..19e1a5493778 100644 >>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct >>>>>>>> amdgpu_device *adev, >>>>>>>> dev_info(adev->dev, "GPU reset end with ret = >>>>>>>> %d\n", r); >>>>>>>> atomic_set(&adev->reset_domain->reset_res, r); >>>>>>>> + >>>>>>>> + drm_dev_wedged_event(adev_to_drm(adev), >>>>>>>> DRM_WEDGE_RECOVERY_NONE); >>>>>>> >>>>>>> That looks really good in general. I would just make the >>>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r". >>>>>>> >>>>>> >>>>>> Why depend or `r`? A reset was triggered anyway, regardless of the >>>>>> success >>>>>> of it, shouldn't we tell userspace? >>>>> >>>>> A failed reset would perhaps result in wedging, atleast that's how >>>>> i915 >>>>> is handling it. >>>>> >>>> >>>> Right, and I think this raises the question of what wedge recovery >>>> method should I add for amdgpu... Christian? >>>> >>> >>> In theory a rebind should be enough to get the device going again, our >>> BOCO does a bus reset on driver load anyway. >>> >> >> The behavior varies between SOCs. In certain ones, if driver reset >> fails, that means it's really in a bad state and it would need system >> reboot. >> > > Is this documented somewhere? Then I could even add a > DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario. Not publicly as far as I know. But indeed a driver reset has basically the same chance of succeeding than a driver reload. I think the use case we have here is more that the administrator intentionally disabled the reset to allow HW investigation. So far we did that with a rather broken we don't do anything at all approach. >> I had asked earlier about the utility of this one here. If this is just >> to inform userspace that driver has done a reset and recovered, it would >> need some additional context also. We have a mechanism in KFD which >> sends the context in which a reset has to be done. Currently, that's >> restricted to compute applications, but if this is in a similar line, we >> would like to pass some additional info like job timeout, RAS error etc. >> > > DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a > reset and recovered, but additional data about like which job timeout, > RAS error and such belong to devcoredump I guess, where all data is > gathered and collected later. I think somebody else mentioned it as well that the source of the issue, e.g. the PID of the submitting process would be helpful as well for supervising daemons which need to restart processes when they caused some issue. We just postponed adding that till later. Regards, Christian. > >> Thanks, >> Lijo >> >>> Regards, >>> Christian. >> >
Em 16/12/2024 10:10, Christian König escreveu: > Am 16.12.24 um 14:04 schrieb André Almeida: >> Em 16/12/2024 07:38, Lazar, Lijo escreveu: >>> >>> >>> On 12/16/2024 3:48 PM, Christian König wrote: >>>> Am 13.12.24 um 16:56 schrieb André Almeida: >>>>> Em 13/12/2024 11:36, Raag Jadav escreveu: >>>>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote: >>>>>>> Hi Christian, >>>>>>> >>>>>>> Em 13/12/2024 04:34, Christian König escreveu: >>>>>>>> Am 12.12.24 um 20:09 schrieb André Almeida: >>>>>>>>> Use DRM's device wedged event to notify userspace that a reset had >>>>>>>>> happened. For now, only use `none` method meant for telemetry >>>>>>>>> capture. >>>>>>>>> >>>>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com> >>>>>>>>> --- >>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++ >>>>>>>>> 1 file changed, 3 insertions(+) >>>>>>>>> >>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c >>>>>>>>> index 96316111300a..19e1a5493778 100644 >>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct >>>>>>>>> amdgpu_device *adev, >>>>>>>>> dev_info(adev->dev, "GPU reset end with ret = >>>>>>>>> %d\n", r); >>>>>>>>> atomic_set(&adev->reset_domain->reset_res, r); >>>>>>>>> + >>>>>>>>> + drm_dev_wedged_event(adev_to_drm(adev), >>>>>>>>> DRM_WEDGE_RECOVERY_NONE); >>>>>>>> >>>>>>>> That looks really good in general. I would just make the >>>>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r". >>>>>>>> >>>>>>> >>>>>>> Why depend or `r`? A reset was triggered anyway, regardless of the >>>>>>> success >>>>>>> of it, shouldn't we tell userspace? >>>>>> >>>>>> A failed reset would perhaps result in wedging, atleast that's how >>>>>> i915 >>>>>> is handling it. >>>>>> >>>>> >>>>> Right, and I think this raises the question of what wedge recovery >>>>> method should I add for amdgpu... Christian? >>>>> >>>> >>>> In theory a rebind should be enough to get the device going again, our >>>> BOCO does a bus reset on driver load anyway. >>>> >>> >>> The behavior varies between SOCs. In certain ones, if driver reset >>> fails, that means it's really in a bad state and it would need system >>> reboot. >>> >> >> Is this documented somewhere? Then I could even add a >> DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario. > > Not publicly as far as I know. But indeed a driver reset has basically > the same chance of succeeding than a driver reload. > > I think the use case we have here is more that the administrator > intentionally disabled the reset to allow HW investigation. > > So far we did that with a rather broken we don't do anything at all > approach. OK. > >>> I had asked earlier about the utility of this one here. If this is just >>> to inform userspace that driver has done a reset and recovered, it would >>> need some additional context also. We have a mechanism in KFD which >>> sends the context in which a reset has to be done. Currently, that's >>> restricted to compute applications, but if this is in a similar line, we >>> would like to pass some additional info like job timeout, RAS error etc. >>> >> >> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a >> reset and recovered, but additional data about like which job timeout, >> RAS error and such belong to devcoredump I guess, where all data is >> gathered and collected later. > > I think somebody else mentioned it as well that the source of the issue, > e.g. the PID of the submitting process would be helpful as well for > supervising daemons which need to restart processes when they caused > some issue. > It was me :) we have a use case that we would need the PID for the daemon indeed, but the daemon doesn't need to know what's the RAS error or the job name that timeouted, there's no immediate action to be taken with this information, contrary to the PID that we need to know. > We just postponed adding that till later. > > Regards, > Christian. > >> >>> Thanks, >>> Lijo >>> >>>> Regards, >>>> Christian. >>> >> >
Am 16.12.24 um 14:15 schrieb André Almeida: > Em 16/12/2024 10:10, Christian König escreveu: >> Am 16.12.24 um 14:04 schrieb André Almeida: >>> Em 16/12/2024 07:38, Lazar, Lijo escreveu: >>>> >>>> >>>> On 12/16/2024 3:48 PM, Christian König wrote: >>>>> Am 13.12.24 um 16:56 schrieb André Almeida: >>>>>> Em 13/12/2024 11:36, Raag Jadav escreveu: >>>>>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote: >>>>>>>> Hi Christian, >>>>>>>> >>>>>>>> Em 13/12/2024 04:34, Christian König escreveu: >>>>>>>>> Am 12.12.24 um 20:09 schrieb André Almeida: >>>>>>>>>> Use DRM's device wedged event to notify userspace that a >>>>>>>>>> reset had >>>>>>>>>> happened. For now, only use `none` method meant for telemetry >>>>>>>>>> capture. >>>>>>>>>> >>>>>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com> >>>>>>>>>> --- >>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++ >>>>>>>>>> 1 file changed, 3 insertions(+) >>>>>>>>>> >>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>> index 96316111300a..19e1a5493778 100644 >>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct >>>>>>>>>> amdgpu_device *adev, >>>>>>>>>> dev_info(adev->dev, "GPU reset end with ret = >>>>>>>>>> %d\n", r); >>>>>>>>>> atomic_set(&adev->reset_domain->reset_res, r); >>>>>>>>>> + >>>>>>>>>> + drm_dev_wedged_event(adev_to_drm(adev), >>>>>>>>>> DRM_WEDGE_RECOVERY_NONE); >>>>>>>>> >>>>>>>>> That looks really good in general. I would just make the >>>>>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r". >>>>>>>>> >>>>>>>> >>>>>>>> Why depend or `r`? A reset was triggered anyway, regardless of the >>>>>>>> success >>>>>>>> of it, shouldn't we tell userspace? >>>>>>> >>>>>>> A failed reset would perhaps result in wedging, atleast that's >>>>>>> how i915 >>>>>>> is handling it. >>>>>>> >>>>>> >>>>>> Right, and I think this raises the question of what wedge recovery >>>>>> method should I add for amdgpu... Christian? >>>>>> >>>>> >>>>> In theory a rebind should be enough to get the device going again, >>>>> our >>>>> BOCO does a bus reset on driver load anyway. >>>>> >>>> >>>> The behavior varies between SOCs. In certain ones, if driver reset >>>> fails, that means it's really in a bad state and it would need system >>>> reboot. >>>> >>> >>> Is this documented somewhere? Then I could even add a >>> DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario. >> >> Not publicly as far as I know. But indeed a driver reset has >> basically the same chance of succeeding than a driver reload. >> >> I think the use case we have here is more that the administrator >> intentionally disabled the reset to allow HW investigation. >> >> So far we did that with a rather broken we don't do anything at all >> approach. > > OK. > >> >>>> I had asked earlier about the utility of this one here. If this is >>>> just >>>> to inform userspace that driver has done a reset and recovered, it >>>> would >>>> need some additional context also. We have a mechanism in KFD which >>>> sends the context in which a reset has to be done. Currently, that's >>>> restricted to compute applications, but if this is in a similar >>>> line, we >>>> would like to pass some additional info like job timeout, RAS error >>>> etc. >>>> >>> >>> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done >>> a reset and recovered, but additional data about like which job >>> timeout, RAS error and such belong to devcoredump I guess, where all >>> data is gathered and collected later. >> >> I think somebody else mentioned it as well that the source of the >> issue, e.g. the PID of the submitting process would be helpful as >> well for supervising daemons which need to restart processes when >> they caused some issue. >> > > It was me :) we have a use case that we would need the PID for the > daemon indeed, but the daemon doesn't need to know what's the RAS > error or the job name that timeouted, there's no immediate action to > be taken with this information, contrary to the PID that we need to know. Yeah, that's all stuff for the device core dump I think. But if you want to add the PID for high level control then that would make sense I think. Regards, Christian. > >> We just postponed adding that till later. >> >> Regards, >> Christian. >> >>> >>>> Thanks, >>>> Lijo >>>> >>>>> Regards, >>>>> Christian. >>>> >>> >> >
On 12/16/2024 6:45 PM, André Almeida wrote: > > > Em 16/12/2024 10:10, Christian König escreveu: >> Am 16.12.24 um 14:04 schrieb André Almeida: >>> Em 16/12/2024 07:38, Lazar, Lijo escreveu: >>>> >>>> >>>> On 12/16/2024 3:48 PM, Christian König wrote: >>>>> Am 13.12.24 um 16:56 schrieb André Almeida: >>>>>> Em 13/12/2024 11:36, Raag Jadav escreveu: >>>>>>> On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote: >>>>>>>> Hi Christian, >>>>>>>> >>>>>>>> Em 13/12/2024 04:34, Christian König escreveu: >>>>>>>>> Am 12.12.24 um 20:09 schrieb André Almeida: >>>>>>>>>> Use DRM's device wedged event to notify userspace that a reset >>>>>>>>>> had >>>>>>>>>> happened. For now, only use `none` method meant for telemetry >>>>>>>>>> capture. >>>>>>>>>> >>>>>>>>>> Signed-off-by: André Almeida <andrealmeid@igalia.com> >>>>>>>>>> --- >>>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++ >>>>>>>>>> 1 file changed, 3 insertions(+) >>>>>>>>>> >>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>> b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>> index 96316111300a..19e1a5493778 100644 >>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>>>>> @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct >>>>>>>>>> amdgpu_device *adev, >>>>>>>>>> dev_info(adev->dev, "GPU reset end with ret = >>>>>>>>>> %d\n", r); >>>>>>>>>> atomic_set(&adev->reset_domain->reset_res, r); >>>>>>>>>> + >>>>>>>>>> + drm_dev_wedged_event(adev_to_drm(adev), >>>>>>>>>> DRM_WEDGE_RECOVERY_NONE); >>>>>>>>> >>>>>>>>> That looks really good in general. I would just make the >>>>>>>>> DRM_WEDGE_RECOVERY_NONE depend on the value of "r". >>>>>>>>> >>>>>>>> >>>>>>>> Why depend or `r`? A reset was triggered anyway, regardless of the >>>>>>>> success >>>>>>>> of it, shouldn't we tell userspace? >>>>>>> >>>>>>> A failed reset would perhaps result in wedging, atleast that's >>>>>>> how i915 >>>>>>> is handling it. >>>>>>> >>>>>> >>>>>> Right, and I think this raises the question of what wedge recovery >>>>>> method should I add for amdgpu... Christian? >>>>>> >>>>> >>>>> In theory a rebind should be enough to get the device going again, our >>>>> BOCO does a bus reset on driver load anyway. >>>>> >>>> >>>> The behavior varies between SOCs. In certain ones, if driver reset >>>> fails, that means it's really in a bad state and it would need system >>>> reboot. >>>> >>> >>> Is this documented somewhere? Then I could even add a >>> DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario. >> >> Not publicly as far as I know. But indeed a driver reset has basically >> the same chance of succeeding than a driver reload. >> >> I think the use case we have here is more that the administrator >> intentionally disabled the reset to allow HW investigation. >> >> So far we did that with a rather broken we don't do anything at all >> approach. > > OK. > >> >>>> I had asked earlier about the utility of this one here. If this is just >>>> to inform userspace that driver has done a reset and recovered, it >>>> would >>>> need some additional context also. We have a mechanism in KFD which >>>> sends the context in which a reset has to be done. Currently, that's >>>> restricted to compute applications, but if this is in a similar >>>> line, we >>>> would like to pass some additional info like job timeout, RAS error >>>> etc. >>>> >>> >>> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a >>> reset and recovered, but additional data about like which job >>> timeout, RAS error and such belong to devcoredump I guess, where all >>> data is gathered and collected later. >> >> I think somebody else mentioned it as well that the source of the >> issue, e.g. the PID of the submitting process would be helpful as well >> for supervising daemons which need to restart processes when they >> caused some issue. >> > > It was me :) we have a use case that we would need the PID for the > daemon indeed, but the daemon doesn't need to know what's the RAS error > or the job name that timeouted, there's no immediate action to be taken > with this information, contrary to the PID that we need to know. > Regarding devcoredump - it's not done every time. For ex: RAS errors have a different way to identify the source of error, hence we don't need a coredump in such cases. The intention is only to let the user know the reason for reset at a high level, and probably add more things later like the engines or queues that have reset etc. Thanks, Lijo >> We just postponed adding that till later. >> >> Regards, >> Christian. >> >>> >>>> Thanks, >>>> Lijo >>>> >>>>> Regards, >>>>> Christian. >>>> >>> >> >
Am 16.12.24 um 14:36 schrieb Lazar, Lijo: >>>>> I had asked earlier about the utility of this one here. If this is just >>>>> to inform userspace that driver has done a reset and recovered, it >>>>> would >>>>> need some additional context also. We have a mechanism in KFD which >>>>> sends the context in which a reset has to be done. Currently, that's >>>>> restricted to compute applications, but if this is in a similar >>>>> line, we >>>>> would like to pass some additional info like job timeout, RAS error >>>>> etc. >>>>> >>>> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a >>>> reset and recovered, but additional data about like which job >>>> timeout, RAS error and such belong to devcoredump I guess, where all >>>> data is gathered and collected later. >>> I think somebody else mentioned it as well that the source of the >>> issue, e.g. the PID of the submitting process would be helpful as well >>> for supervising daemons which need to restart processes when they >>> caused some issue. >>> >> It was me :) we have a use case that we would need the PID for the >> daemon indeed, but the daemon doesn't need to know what's the RAS error >> or the job name that timeouted, there's no immediate action to be taken >> with this information, contrary to the PID that we need to know. >> > Regarding devcoredump - it's not done every time. For ex: RAS errors > have a different way to identify the source of error, hence we don't > need a coredump in such cases. > > The intention is only to let the user know the reason for reset at a > high level, and probably add more things later like the engines or > queues that have reset etc. Well what is the use case for that? That doesn't looks valuable to me. RAS errors should generally be reported to the application who issued the submission. As a system wide event they are only useful in things like logfiles I think. Regards, Christian. > > Thanks, > Lijo > >>> We just postponed adding that till later. >>> >>> Regards, >>> Christian. >>> >>>>> Thanks, >>>>> Lijo >>>>> >>>>>> Regards, >>>>>> Christian.
On 12/16/2024 7:09 PM, Christian König wrote: > Am 16.12.24 um 14:36 schrieb Lazar, Lijo: >>>>>> I had asked earlier about the utility of this one here. If this is just >>>>>> to inform userspace that driver has done a reset and recovered, it >>>>>> would >>>>>> need some additional context also. We have a mechanism in KFD which >>>>>> sends the context in which a reset has to be done. Currently, that's >>>>>> restricted to compute applications, but if this is in a similar >>>>>> line, we >>>>>> would like to pass some additional info like job timeout, RAS error >>>>>> etc. >>>>>> >>>>> DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done a >>>>> reset and recovered, but additional data about like which job >>>>> timeout, RAS error and such belong to devcoredump I guess, where all >>>>> data is gathered and collected later. >>>> I think somebody else mentioned it as well that the source of the >>>> issue, e.g. the PID of the submitting process would be helpful as well >>>> for supervising daemons which need to restart processes when they >>>> caused some issue. >>>> >>> It was me :) we have a use case that we would need the PID for the >>> daemon indeed, but the daemon doesn't need to know what's the RAS error >>> or the job name that timeouted, there's no immediate action to be taken >>> with this information, contrary to the PID that we need to know. >>> >> Regarding devcoredump - it's not done every time. For ex: RAS errors >> have a different way to identify the source of error, hence we don't >> need a coredump in such cases. >> >> The intention is only to let the user know the reason for reset at a >> high level, and probably add more things later like the engines or >> queues that have reset etc. > > Well what is the use case for that? That doesn't looks valuable to me. It's mostly for in-band telemetry reporting through tools like amd-smi - more for admin purpose rather than any debug. Thanks, Lijo > > RAS errors should generally be reported to the application who issued > the submission. > > As a system wide event they are only useful in things like logfiles I think. > > Regards, > Christian. > >> Thanks, >> Lijo >> >>>> We just postponed adding that till later. >>>> >>>> Regards, >>>> Christian. >>>> >>>>>> Thanks, >>>>>> Lijo >>>>>> >>>>>>> Regards, >>>>>>> Christian. >
On Mon, Dec 16, 2024 at 10:15:00AM -0300, André Almeida wrote: > Em 16/12/2024 10:10, Christian König escreveu: > > Am 16.12.24 um 14:04 schrieb André Almeida: > > > Em 16/12/2024 07:38, Lazar, Lijo escreveu: > > > > > > > > > > > > On 12/16/2024 3:48 PM, Christian König wrote: > > > > > Am 13.12.24 um 16:56 schrieb André Almeida: > > > > > > Em 13/12/2024 11:36, Raag Jadav escreveu: > > > > > > > On Fri, Dec 13, 2024 at 11:15:31AM -0300, André Almeida wrote: > > > > > > > > Hi Christian, > > > > > > > > > > > > > > > > Em 13/12/2024 04:34, Christian König escreveu: > > > > > > > > > Am 12.12.24 um 20:09 schrieb André Almeida: > > > > > > > > > > Use DRM's device wedged event to notify userspace that a reset had > > > > > > > > > > happened. For now, only use `none` method meant for telemetry > > > > > > > > > > capture. > > > > > > > > > > > > > > > > > > > > Signed-off-by: André Almeida <andrealmeid@igalia.com> > > > > > > > > > > --- > > > > > > > > > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++ > > > > > > > > > > 1 file changed, 3 insertions(+) > > > > > > > > > > > > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > > > > > > > > b/drivers/gpu/ drm/amd/amdgpu/amdgpu_device.c > > > > > > > > > > index 96316111300a..19e1a5493778 100644 > > > > > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > > > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > > > > > > > > > @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct > > > > > > > > > > amdgpu_device *adev, > > > > > > > > > > dev_info(adev->dev, "GPU > > > > > > > > > > reset end with ret = %d\n", r); > > > > > > > > > > atomic_set(&adev->reset_domain->reset_res, r); > > > > > > > > > > + > > > > > > > > > > + drm_dev_wedged_event(adev_to_drm(adev), > > > > > > > > > > DRM_WEDGE_RECOVERY_NONE); > > > > > > > > > > > > > > > > > > That looks really good in general. I would just make the > > > > > > > > > DRM_WEDGE_RECOVERY_NONE depend on the value of "r". > > > > > > > > > > > > > > > > > > > > > > > > > Why depend or `r`? A reset was triggered anyway, regardless of the > > > > > > > > success > > > > > > > > of it, shouldn't we tell userspace? > > > > > > > > > > > > > > A failed reset would perhaps result in wedging, > > > > > > > atleast that's how i915 > > > > > > > is handling it. > > > > > > > > > > > > > > > > > > > Right, and I think this raises the question of what wedge recovery > > > > > > method should I add for amdgpu... Christian? > > > > > > > > > > > > > > > > In theory a rebind should be enough to get the device going again, our > > > > > BOCO does a bus reset on driver load anyway. > > > > > > > > > > > > > The behavior varies between SOCs. In certain ones, if driver reset > > > > fails, that means it's really in a bad state and it would need system > > > > reboot. > > > > > > > > > > Is this documented somewhere? Then I could even add a > > > DRM_WEDGE_RECOVERY_REBOOT so we can cover every scenario. This was present in drafts v5 through v7 but later dropped with the understanding that it is unwise to let a drm device make system level decisions and rather have something like WEDGED=unknown to let admin/user decide how to deal with it. https://patchwork.freedesktop.org/series/138069/ > > Not publicly as far as I know. But indeed a driver reset has basically > > the same chance of succeeding than a driver reload. > > > > I think the use case we have here is more that the administrator > > intentionally disabled the reset to allow HW investigation. > > > > So far we did that with a rather broken we don't do anything at all > > approach. > > OK. > > > > > > > I had asked earlier about the utility of this one here. If this is just > > > > to inform userspace that driver has done a reset and recovered, it would > > > > need some additional context also. We have a mechanism in KFD which > > > > sends the context in which a reset has to be done. Currently, that's > > > > restricted to compute applications, but if this is in a similar line, we > > > > would like to pass some additional info like job timeout, RAS error etc. > > > > > > > > > > DRM_WEDGE_RECOVERY_NONE is to inform userspace that driver has done > > > a reset and recovered, but additional data about like which job > > > timeout, RAS error and such belong to devcoredump I guess, where all > > > data is gathered and collected later. > > > > I think somebody else mentioned it as well that the source of the issue, > > e.g. the PID of the submitting process would be helpful as well for > > supervising daemons which need to restart processes when they caused > > some issue. > > > > It was me :) we have a use case that we would need the PID for the daemon > indeed, but the daemon doesn't need to know what's the RAS error or the job > name that timeouted, there's no immediate action to be taken with this > information, contrary to the PID that we need to know. I think this calls for the standardization of telemetry (devcoredump, syslog etc) but since each driver has its own way of doing it, it'd be quite an uphill battle. Raag
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 96316111300a..19e1a5493778 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -6057,6 +6057,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, dev_info(adev->dev, "GPU reset end with ret = %d\n", r); atomic_set(&adev->reset_domain->reset_res, r); + + drm_dev_wedged_event(adev_to_drm(adev), DRM_WEDGE_RECOVERY_NONE); + return r; }
Use DRM's device wedged event to notify userspace that a reset had happened. For now, only use `none` method meant for telemetry capture. Signed-off-by: André Almeida <andrealmeid@igalia.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++ 1 file changed, 3 insertions(+)