Message ID | 20220125223752.200211-1-andrey.grodzovsky@amd.com (mailing list archive) |
---|---|
Headers | show |
Series | Define and use reset domain for GPU recovery in amdgpu | expand |
Just a gentle ping if people have more comments on this patch set ? Especially last 5 patches as first 7 are exact same as V2 and we already went over them mostly. Andrey On 2022-01-25 17:37, Andrey Grodzovsky wrote: > This patchset is based on earlier work by Boris[1] that allowed to have an > ordered workqueue at the driver level that will be used by the different > schedulers to queue their timeout work. On top of that I also serialized > any GPU reset we trigger from within amdgpu code to also go through the same > ordered wq and in this way simplify somewhat our GPU reset code so we don't need > to protect from concurrency by multiple GPU reset triggeres such as TDR on one > hand and sysfs trigger or RAS trigger on the other hand. > > As advised by Christian and Daniel I defined a reset_domain struct such that > all the entities that go through reset together will be serialized one against > another. > > TDR triggered by multiple entities within the same domain due to the same reason will not > be triggered as the first such reset will cancel all the pending resets. This is > relevant only to TDR timers and not to triggered resets coming from RAS or SYSFS, > those will still happen after the in flight resets finishes. > > v2: > Add handling on SRIOV configuration, the reset notify coming from host > and driver already trigger a work queue to handle the reset so drop this > intermediate wq and send directly to timeout wq. (Shaoyun) > > v3: > Lijo suggested puting 'adev->in_gpu_reset' in amdgpu_reset_domain struct. > I followed his advise and also moved adev->reset_sem into same place. This > in turn caused to do some follow-up refactor of the original patches > where i decoupled amdgpu_reset_domain life cycle frolm XGMI hive because hive is destroyed and > reconstructed for the case of reset the devices in the XGMI hive during probe for SRIOV See [2] > while we need the reset sem and gpu_reset flag to always be present. This was attained > by adding refcount to amdgpu_reset_domain so each device can safely point to it as long as > it needs. > > > [1] https://patchwork.kernel.org/project/dri-devel/patch/20210629073510.2764391-3-boris.brezillon@collabora.com/ > [2] https://www.spinics.net/lists/amd-gfx/msg58836.html > > P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work hasn't landed yet there. > > P.P.S Patches 8-12 are the refactor on top of the original V2 patchset. > > P.P.P.S I wasn't able yet to test the reworked code on XGMI SRIOV system because drm-misc-next fails to load there. > Would appriciate if maybe jingwech can try it on his system like he tested V2. > > Andrey Grodzovsky (12): > drm/amdgpu: Introduce reset domain > drm/amdgpu: Move scheduler init to after XGMI is ready > drm/amdgpu: Fix crash on modprobe > drm/amdgpu: Serialize non TDR gpu recovery with TDRs > drm/amd/virt: For SRIOV send GPU reset directly to TDR queue. > drm/amdgpu: Drop hive->in_reset > drm/amdgpu: Drop concurrent GPU reset protection for device > drm/amdgpu: Rework reset domain to be refcounted. > drm/amdgpu: Move reset sem into reset_domain > drm/amdgpu: Move in_gpu_reset into reset_domain > drm/amdgpu: Rework amdgpu_device_lock_adev > Revert 'drm/amdgpu: annotate a false positive recursive locking' > > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 15 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 10 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 275 ++++++++++-------- > drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 43 +-- > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +- > .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 18 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 39 +++ > drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 12 + > drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 2 + > drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 24 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h | 3 +- > drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 6 +- > drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 14 +- > drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 19 +- > drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 19 +- > drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 11 +- > 16 files changed, 313 insertions(+), 199 deletions(-) >
Just another ping, with Shyun's help I was able to do some smoke testing on XGMI SRIOV system (booting and triggering hive reset) and for now looks good. Andrey On 2022-01-28 14:36, Andrey Grodzovsky wrote: > Just a gentle ping if people have more comments on this patch set ? > Especially last 5 patches > as first 7 are exact same as V2 and we already went over them mostly. > > Andrey > > On 2022-01-25 17:37, Andrey Grodzovsky wrote: >> This patchset is based on earlier work by Boris[1] that allowed to >> have an >> ordered workqueue at the driver level that will be used by the different >> schedulers to queue their timeout work. On top of that I also serialized >> any GPU reset we trigger from within amdgpu code to also go through >> the same >> ordered wq and in this way simplify somewhat our GPU reset code so we >> don't need >> to protect from concurrency by multiple GPU reset triggeres such as >> TDR on one >> hand and sysfs trigger or RAS trigger on the other hand. >> >> As advised by Christian and Daniel I defined a reset_domain struct >> such that >> all the entities that go through reset together will be serialized >> one against >> another. >> >> TDR triggered by multiple entities within the same domain due to the >> same reason will not >> be triggered as the first such reset will cancel all the pending >> resets. This is >> relevant only to TDR timers and not to triggered resets coming from >> RAS or SYSFS, >> those will still happen after the in flight resets finishes. >> >> v2: >> Add handling on SRIOV configuration, the reset notify coming from host >> and driver already trigger a work queue to handle the reset so drop this >> intermediate wq and send directly to timeout wq. (Shaoyun) >> >> v3: >> Lijo suggested puting 'adev->in_gpu_reset' in amdgpu_reset_domain >> struct. >> I followed his advise and also moved adev->reset_sem into same place. >> This >> in turn caused to do some follow-up refactor of the original patches >> where i decoupled amdgpu_reset_domain life cycle frolm XGMI hive >> because hive is destroyed and >> reconstructed for the case of reset the devices in the XGMI hive >> during probe for SRIOV See [2] >> while we need the reset sem and gpu_reset flag to always be present. >> This was attained >> by adding refcount to amdgpu_reset_domain so each device can safely >> point to it as long as >> it needs. >> >> >> [1] >> https://patchwork.kernel.org/project/dri-devel/patch/20210629073510.2764391-3-boris.brezillon@collabora.com/ >> [2] https://www.spinics.net/lists/amd-gfx/msg58836.html >> >> P.S Going through drm-misc-next and not amd-staging-drm-next as Boris >> work hasn't landed yet there. >> >> P.P.S Patches 8-12 are the refactor on top of the original V2 patchset. >> >> P.P.P.S I wasn't able yet to test the reworked code on XGMI SRIOV >> system because drm-misc-next fails to load there. >> Would appriciate if maybe jingwech can try it on his system like he >> tested V2. >> >> Andrey Grodzovsky (12): >> drm/amdgpu: Introduce reset domain >> drm/amdgpu: Move scheduler init to after XGMI is ready >> drm/amdgpu: Fix crash on modprobe >> drm/amdgpu: Serialize non TDR gpu recovery with TDRs >> drm/amd/virt: For SRIOV send GPU reset directly to TDR queue. >> drm/amdgpu: Drop hive->in_reset >> drm/amdgpu: Drop concurrent GPU reset protection for device >> drm/amdgpu: Rework reset domain to be refcounted. >> drm/amdgpu: Move reset sem into reset_domain >> drm/amdgpu: Move in_gpu_reset into reset_domain >> drm/amdgpu: Rework amdgpu_device_lock_adev >> Revert 'drm/amdgpu: annotate a false positive recursive locking' >> >> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 15 +- >> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 10 +- >> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 275 ++++++++++-------- >> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 43 +-- >> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +- >> .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 18 +- >> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 39 +++ >> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 12 + >> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 2 + >> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 24 +- >> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h | 3 +- >> drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 6 +- >> drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 14 +- >> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 19 +- >> drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 19 +- >> drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 11 +- >> 16 files changed, 313 insertions(+), 199 deletions(-) >>
Hi Andrey, I have been testing your patch and it seems fine till now. Best Regards, Jingwen Chen On 2022/2/3 上午2:57, Andrey Grodzovsky wrote: > Just another ping, with Shyun's help I was able to do some smoke testing on XGMI SRIOV system (booting and triggering hive reset) > and for now looks good. > > Andrey > > On 2022-01-28 14:36, Andrey Grodzovsky wrote: >> Just a gentle ping if people have more comments on this patch set ? Especially last 5 patches >> as first 7 are exact same as V2 and we already went over them mostly. >> >> Andrey >> >> On 2022-01-25 17:37, Andrey Grodzovsky wrote: >>> This patchset is based on earlier work by Boris[1] that allowed to have an >>> ordered workqueue at the driver level that will be used by the different >>> schedulers to queue their timeout work. On top of that I also serialized >>> any GPU reset we trigger from within amdgpu code to also go through the same >>> ordered wq and in this way simplify somewhat our GPU reset code so we don't need >>> to protect from concurrency by multiple GPU reset triggeres such as TDR on one >>> hand and sysfs trigger or RAS trigger on the other hand. >>> >>> As advised by Christian and Daniel I defined a reset_domain struct such that >>> all the entities that go through reset together will be serialized one against >>> another. >>> >>> TDR triggered by multiple entities within the same domain due to the same reason will not >>> be triggered as the first such reset will cancel all the pending resets. This is >>> relevant only to TDR timers and not to triggered resets coming from RAS or SYSFS, >>> those will still happen after the in flight resets finishes. >>> >>> v2: >>> Add handling on SRIOV configuration, the reset notify coming from host >>> and driver already trigger a work queue to handle the reset so drop this >>> intermediate wq and send directly to timeout wq. (Shaoyun) >>> >>> v3: >>> Lijo suggested puting 'adev->in_gpu_reset' in amdgpu_reset_domain struct. >>> I followed his advise and also moved adev->reset_sem into same place. This >>> in turn caused to do some follow-up refactor of the original patches >>> where i decoupled amdgpu_reset_domain life cycle frolm XGMI hive because hive is destroyed and >>> reconstructed for the case of reset the devices in the XGMI hive during probe for SRIOV See [2] >>> while we need the reset sem and gpu_reset flag to always be present. This was attained >>> by adding refcount to amdgpu_reset_domain so each device can safely point to it as long as >>> it needs. >>> >>> >>> [1] https://patchwork.kernel.org/project/dri-devel/patch/20210629073510.2764391-3-boris.brezillon@collabora.com/ >>> [2] https://www.spinics.net/lists/amd-gfx/msg58836.html >>> >>> P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work hasn't landed yet there. >>> >>> P.P.S Patches 8-12 are the refactor on top of the original V2 patchset. >>> >>> P.P.P.S I wasn't able yet to test the reworked code on XGMI SRIOV system because drm-misc-next fails to load there. >>> Would appriciate if maybe jingwech can try it on his system like he tested V2. >>> >>> Andrey Grodzovsky (12): >>> drm/amdgpu: Introduce reset domain >>> drm/amdgpu: Move scheduler init to after XGMI is ready >>> drm/amdgpu: Fix crash on modprobe >>> drm/amdgpu: Serialize non TDR gpu recovery with TDRs >>> drm/amd/virt: For SRIOV send GPU reset directly to TDR queue. >>> drm/amdgpu: Drop hive->in_reset >>> drm/amdgpu: Drop concurrent GPU reset protection for device >>> drm/amdgpu: Rework reset domain to be refcounted. >>> drm/amdgpu: Move reset sem into reset_domain >>> drm/amdgpu: Move in_gpu_reset into reset_domain >>> drm/amdgpu: Rework amdgpu_device_lock_adev >>> Revert 'drm/amdgpu: annotate a false positive recursive locking' >>> >>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 15 +- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 10 +- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 275 ++++++++++-------- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 43 +-- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +- >>> .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 18 +- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 39 +++ >>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 12 + >>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 2 + >>> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 24 +- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h | 3 +- >>> drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 6 +- >>> drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 14 +- >>> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 19 +- >>> drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 19 +- >>> drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 11 +- >>> 16 files changed, 313 insertions(+), 199 deletions(-) >>>
Thanks a lot! Andrey On 2022-02-09 01:06, JingWen Chen wrote: > Hi Andrey, > > I have been testing your patch and it seems fine till now. > > Best Regards, > > Jingwen Chen > > On 2022/2/3 上午2:57, Andrey Grodzovsky wrote: >> Just another ping, with Shyun's help I was able to do some smoke testing on XGMI SRIOV system (booting and triggering hive reset) >> and for now looks good. >> >> Andrey >> >> On 2022-01-28 14:36, Andrey Grodzovsky wrote: >>> Just a gentle ping if people have more comments on this patch set ? Especially last 5 patches >>> as first 7 are exact same as V2 and we already went over them mostly. >>> >>> Andrey >>> >>> On 2022-01-25 17:37, Andrey Grodzovsky wrote: >>>> This patchset is based on earlier work by Boris[1] that allowed to have an >>>> ordered workqueue at the driver level that will be used by the different >>>> schedulers to queue their timeout work. On top of that I also serialized >>>> any GPU reset we trigger from within amdgpu code to also go through the same >>>> ordered wq and in this way simplify somewhat our GPU reset code so we don't need >>>> to protect from concurrency by multiple GPU reset triggeres such as TDR on one >>>> hand and sysfs trigger or RAS trigger on the other hand. >>>> >>>> As advised by Christian and Daniel I defined a reset_domain struct such that >>>> all the entities that go through reset together will be serialized one against >>>> another. >>>> >>>> TDR triggered by multiple entities within the same domain due to the same reason will not >>>> be triggered as the first such reset will cancel all the pending resets. This is >>>> relevant only to TDR timers and not to triggered resets coming from RAS or SYSFS, >>>> those will still happen after the in flight resets finishes. >>>> >>>> v2: >>>> Add handling on SRIOV configuration, the reset notify coming from host >>>> and driver already trigger a work queue to handle the reset so drop this >>>> intermediate wq and send directly to timeout wq. (Shaoyun) >>>> >>>> v3: >>>> Lijo suggested puting 'adev->in_gpu_reset' in amdgpu_reset_domain struct. >>>> I followed his advise and also moved adev->reset_sem into same place. This >>>> in turn caused to do some follow-up refactor of the original patches >>>> where i decoupled amdgpu_reset_domain life cycle frolm XGMI hive because hive is destroyed and >>>> reconstructed for the case of reset the devices in the XGMI hive during probe for SRIOV See [2] >>>> while we need the reset sem and gpu_reset flag to always be present. This was attained >>>> by adding refcount to amdgpu_reset_domain so each device can safely point to it as long as >>>> it needs. >>>> >>>> >>>> [1] https://patchwork.kernel.org/project/dri-devel/patch/20210629073510.2764391-3-boris.brezillon@collabora.com/ >>>> [2] https://www.spinics.net/lists/amd-gfx/msg58836.html >>>> >>>> P.S Going through drm-misc-next and not amd-staging-drm-next as Boris work hasn't landed yet there. >>>> >>>> P.P.S Patches 8-12 are the refactor on top of the original V2 patchset. >>>> >>>> P.P.P.S I wasn't able yet to test the reworked code on XGMI SRIOV system because drm-misc-next fails to load there. >>>> Would appriciate if maybe jingwech can try it on his system like he tested V2. >>>> >>>> Andrey Grodzovsky (12): >>>> drm/amdgpu: Introduce reset domain >>>> drm/amdgpu: Move scheduler init to after XGMI is ready >>>> drm/amdgpu: Fix crash on modprobe >>>> drm/amdgpu: Serialize non TDR gpu recovery with TDRs >>>> drm/amd/virt: For SRIOV send GPU reset directly to TDR queue. >>>> drm/amdgpu: Drop hive->in_reset >>>> drm/amdgpu: Drop concurrent GPU reset protection for device >>>> drm/amdgpu: Rework reset domain to be refcounted. >>>> drm/amdgpu: Move reset sem into reset_domain >>>> drm/amdgpu: Move in_gpu_reset into reset_domain >>>> drm/amdgpu: Rework amdgpu_device_lock_adev >>>> Revert 'drm/amdgpu: annotate a false positive recursive locking' >>>> >>>> drivers/gpu/drm/amd/amdgpu/amdgpu.h | 15 +- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 10 +- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 275 ++++++++++-------- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 43 +-- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +- >>>> .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c | 18 +- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 39 +++ >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 12 + >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 2 + >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 24 +- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.h | 3 +- >>>> drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 6 +- >>>> drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c | 14 +- >>>> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 19 +- >>>> drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 19 +- >>>> drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 11 +- >>>> 16 files changed, 313 insertions(+), 199 deletions(-) >>>>