Message ID | 20230424014324.218531-1-andrealmeid@igalia.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/amdgpu: Mark contexts guilty for any reset type | expand |
Hi André,
kernel test robot noticed the following build warnings:
[auto build test WARNING on drm-misc/drm-misc-next]
[also build test WARNING on drm/drm-next drm-exynos/exynos-drm-next drm-intel/for-linux-next drm-intel/for-linux-next-fixes drm-tip/drm-tip linus/master v6.3 next-20230421]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Andr-Almeida/drm-amdgpu-Mark-contexts-guilty-for-any-reset-type/20230424-094534
base: git://anongit.freedesktop.org/drm/drm-misc drm-misc-next
patch link: https://lore.kernel.org/r/20230424014324.218531-1-andrealmeid%40igalia.com
patch subject: [PATCH] drm/amdgpu: Mark contexts guilty for any reset type
config: s390-allyesconfig (https://download.01.org/0day-ci/archive/20230424/202304241259.Qq0Dmlud-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/ea7b1d78b677fdcf5f4776e63de611a2681cd5fb
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Andr-Almeida/drm-amdgpu-Mark-contexts-guilty-for-any-reset-type/20230424-094534
git checkout ea7b1d78b677fdcf5f4776e63de611a2681cd5fb
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 SHELL=/bin/bash drivers/gpu/drm/amd/amdgpu/
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304241259.Qq0Dmlud-lkp@intel.com/
All warnings (new ones prefixed by >>):
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c: In function 'amdgpu_device_pre_asic_reset':
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:4738:28: warning: variable 'job' set but not used [-Wunused-but-set-variable]
4738 | struct amdgpu_job *job = NULL;
| ^~~
vim +/job +4738 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
5740682e66cef5 Monk Liu 2017-10-25 4733
e3c1b0712fdb03 shaoyunl 2021-02-16 4734 int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
04442bf70debb1 Lijo Lazar 2021-03-16 4735 struct amdgpu_reset_context *reset_context)
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4736 {
5c1e6fa49e8d8d Huang Rui 2021-12-16 4737 int i, r = 0;
04442bf70debb1 Lijo Lazar 2021-03-16 @4738 struct amdgpu_job *job = NULL;
04442bf70debb1 Lijo Lazar 2021-03-16 4739 bool need_full_reset =
04442bf70debb1 Lijo Lazar 2021-03-16 4740 test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
04442bf70debb1 Lijo Lazar 2021-03-16 4741
04442bf70debb1 Lijo Lazar 2021-03-16 4742 if (reset_context->reset_req_dev == adev)
04442bf70debb1 Lijo Lazar 2021-03-16 4743 job = reset_context->job;
711826656bebb0 Monk Liu 2017-12-25 4744
b602ca5f31fe69 Tiecheng Zhou 2020-08-19 4745 if (amdgpu_sriov_vf(adev)) {
b602ca5f31fe69 Tiecheng Zhou 2020-08-19 4746 /* stop the data exchange thread */
b602ca5f31fe69 Tiecheng Zhou 2020-08-19 4747 amdgpu_virt_fini_data_exchange(adev);
b602ca5f31fe69 Tiecheng Zhou 2020-08-19 4748 }
b602ca5f31fe69 Tiecheng Zhou 2020-08-19 4749
9e225fb9e636b3 Andrey Grodzovsky 2022-06-18 4750 amdgpu_fence_driver_isr_toggle(adev, true);
9e225fb9e636b3 Andrey Grodzovsky 2022-06-18 4751
711826656bebb0 Monk Liu 2017-12-25 4752 /* block all schedulers and reset given job's ring */
0875dc9e80eb3b Chunming Zhou 2016-06-12 4753 for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
0875dc9e80eb3b Chunming Zhou 2016-06-12 4754 struct amdgpu_ring *ring = adev->rings[i];
0875dc9e80eb3b Chunming Zhou 2016-06-12 4755
51687759be93fb Chunming Zhou 2017-04-24 4756 if (!ring || !ring->sched.thread)
0875dc9e80eb3b Chunming Zhou 2016-06-12 4757 continue;
5740682e66cef5 Monk Liu 2017-10-25 4758
c530b02f39850a Jack Zhang 2021-05-12 4759 /*clear job fence from fence drv to avoid force_completion
c530b02f39850a Jack Zhang 2021-05-12 4760 *leave NULL and vm flush fence in fence drv */
5c1e6fa49e8d8d Huang Rui 2021-12-16 4761 amdgpu_fence_driver_clear_job_fences(ring);
c530b02f39850a Jack Zhang 2021-05-12 4762
2200edac745a65 Chunming Zhou 2016-06-30 4763 /* after all hw jobs are reset, hw fence is meaningless, so force_completion */
2f9d4084cac96a Monk Liu 2017-10-16 4764 amdgpu_fence_driver_force_completion(ring);
2f9d4084cac96a Monk Liu 2017-10-16 4765 }
d38ceaf99ed015 Alex Deucher 2015-04-20 4766
9e225fb9e636b3 Andrey Grodzovsky 2022-06-18 4767 amdgpu_fence_driver_isr_toggle(adev, false);
9e225fb9e636b3 Andrey Grodzovsky 2022-06-18 4768
04442bf70debb1 Lijo Lazar 2021-03-16 4769 r = amdgpu_reset_prepare_hwcontext(adev, reset_context);
404b277bbe4945 Lijo Lazar 2021-03-26 4770 /* If reset handler not implemented, continue; otherwise return */
404b277bbe4945 Lijo Lazar 2021-03-26 4771 if (r == -ENOSYS)
404b277bbe4945 Lijo Lazar 2021-03-26 4772 r = 0;
404b277bbe4945 Lijo Lazar 2021-03-26 4773 else
04442bf70debb1 Lijo Lazar 2021-03-16 4774 return r;
04442bf70debb1 Lijo Lazar 2021-03-16 4775
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4776 /* Don't suspend on bare metal if we are not going to HW reset the ASIC */
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4777 if (!amdgpu_sriov_vf(adev)) {
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4778
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4779 if (!need_full_reset)
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4780 need_full_reset = amdgpu_device_ip_need_full_reset(adev);
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4781
360cd08196cabc Likun Gao 2022-12-21 4782 if (!need_full_reset && amdgpu_gpu_recovery &&
360cd08196cabc Likun Gao 2022-12-21 4783 amdgpu_device_ip_check_soft_reset(adev)) {
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4784 amdgpu_device_ip_pre_soft_reset(adev);
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4785 r = amdgpu_device_ip_soft_reset(adev);
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4786 amdgpu_device_ip_post_soft_reset(adev);
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4787 if (r || amdgpu_device_ip_check_soft_reset(adev)) {
aac891685da661 Dennis Li 2020-08-20 4788 dev_info(adev->dev, "soft reset failed, will fallback to full reset!\n");
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4789 need_full_reset = true;
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4790 }
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4791 }
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4792
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4793 if (need_full_reset)
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4794 r = amdgpu_device_ip_suspend(adev);
04442bf70debb1 Lijo Lazar 2021-03-16 4795 if (need_full_reset)
04442bf70debb1 Lijo Lazar 2021-03-16 4796 set_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
04442bf70debb1 Lijo Lazar 2021-03-16 4797 else
04442bf70debb1 Lijo Lazar 2021-03-16 4798 clear_bit(AMDGPU_NEED_FULL_RESET,
04442bf70debb1 Lijo Lazar 2021-03-16 4799 &reset_context->flags);
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4800 }
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4801
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4802 return r;
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4803 }
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4804
Am 24.04.23 um 03:43 schrieb André Almeida: > When a DRM job timeout, the GPU is probably hang and amdgpu have some > ways to deal with that, ranging from soft recoveries to full device > reset. Anyway, when userspace ask the kernel the state of the context > (via AMDGPU_CTX_OP_QUERY_STATE), the kernel reports that the device was > reset, regardless if a full reset happened or not. > > However, amdgpu only marks a context guilty in the ASIC reset path. This > makes the userspace report incomplete, given that on soft recovery path > the guilty context is not told that it's the guilty one. > > Fix this by marking the context guilty for every type of reset when a > job timeouts. The guilty handling is pretty much broken by design and only works because we go through multiple hops of validating the entity after the job has already been pushed to the hw. I think we should probably just remove that completely and use an approach where we check the in flight submissions in the query state IOCTL. See my other patch on the mailing list regarding that. Additional to that I currently didn't considered soft-recovered submissions as fatal and continue accepting submissions from that context, but already wanted to talk with Marek about that behavior. Regards, Christian. > > Signed-off-by: André Almeida <andrealmeid@igalia.com> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 --- > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 +++++++- > 2 files changed, 7 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index ac78caa7cba8..ea169d1689e2 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -4771,9 +4771,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev, > > amdgpu_fence_driver_isr_toggle(adev, false); > > - if (job && job->vm) > - drm_sched_increase_karma(&job->base); > - > r = amdgpu_reset_prepare_hwcontext(adev, reset_context); > /* If reset handler not implemented, continue; otherwise return */ > if (r == -ENOSYS) > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > index c3d9d75143f4..097ed8f06865 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > @@ -51,6 +51,13 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job) > memset(&ti, 0, sizeof(struct amdgpu_task_info)); > adev->job_hang = true; > > + amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti); > + > + if (job && job->vm) { > + DRM_INFO("marking %s context as guilty", ti.process_name); > + drm_sched_increase_karma(&job->base); > + } > + > if (amdgpu_gpu_recovery && > amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) { > DRM_ERROR("ring %s timeout, but soft recovered\n", > @@ -58,7 +65,6 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job) > goto exit; > } > > - amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti); > DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n", > job->base.sched->name, atomic_read(&ring->fence_drv.last_seq), > ring->fence_drv.sync_seq);
Hi Christian, thank you for your comments. Em 24/04/2023 04:03, Christian König escreveu: > Am 24.04.23 um 03:43 schrieb André Almeida: >> When a DRM job timeout, the GPU is probably hang and amdgpu have some >> ways to deal with that, ranging from soft recoveries to full device >> reset. Anyway, when userspace ask the kernel the state of the context >> (via AMDGPU_CTX_OP_QUERY_STATE), the kernel reports that the device was >> reset, regardless if a full reset happened or not. >> >> However, amdgpu only marks a context guilty in the ASIC reset path. This >> makes the userspace report incomplete, given that on soft recovery path >> the guilty context is not told that it's the guilty one. >> >> Fix this by marking the context guilty for every type of reset when a >> job timeouts. > > The guilty handling is pretty much broken by design and only works > because we go through multiple hops of validating the entity after the > job has already been pushed to the hw. I see, thanks. > > I think we should probably just remove that completely and use an > approach where we check the in flight submissions in the query state > IOCTL. Like the DRM_IOCTL_I915_GET_RESET_STATS approach? > See my other patch on the mailing list regarding that. Which one, the "[PATCH 1/8] drm/scheduler: properly forward fence errors" series? > > Additional to that I currently didn't considered soft-recovered > submissions as fatal and continue accepting submissions from that > context, but already wanted to talk with Marek about that behavior. > Interesting. I will try to test and validate this approach to see if the contexts keep working as expected on soft-resets. > Regards, > Christian.
On 4/24/23 15:26, André Almeida wrote: >> >> Additional to that I currently didn't considered soft-recovered submissions as fatal and continue accepting submissions from that context, but already wanted to talk with Marek about that behavior. >> > > Interesting. I will try to test and validate this approach to see if the contexts keep working as expected on soft-resets. FWIW, on this Thinkpad E595 with a Picasso APU, I've hit soft-resets (usually either in Firefox or gnome-shell) a number of times, and usually continued using the GNOME session for a few days without any issues. (Interestingly, Firefox reacts to the soft-resets by falling back to software rendering, even when it's not guilty itself)
Soft resets are fatal just as hard resets, but no reset is "always fatal". There are cases when apps keep working depending on which features are being used. It's still unsafe. Marek On Mon, Apr 24, 2023, 03:03 Christian König <christian.koenig@amd.com> wrote: > Am 24.04.23 um 03:43 schrieb André Almeida: > > When a DRM job timeout, the GPU is probably hang and amdgpu have some > > ways to deal with that, ranging from soft recoveries to full device > > reset. Anyway, when userspace ask the kernel the state of the context > > (via AMDGPU_CTX_OP_QUERY_STATE), the kernel reports that the device was > > reset, regardless if a full reset happened or not. > > > > However, amdgpu only marks a context guilty in the ASIC reset path. This > > makes the userspace report incomplete, given that on soft recovery path > > the guilty context is not told that it's the guilty one. > > > > Fix this by marking the context guilty for every type of reset when a > > job timeouts. > > The guilty handling is pretty much broken by design and only works > because we go through multiple hops of validating the entity after the > job has already been pushed to the hw. > > I think we should probably just remove that completely and use an > approach where we check the in flight submissions in the query state > IOCTL. See my other patch on the mailing list regarding that. > > Additional to that I currently didn't considered soft-recovered > submissions as fatal and continue accepting submissions from that > context, but already wanted to talk with Marek about that behavior. > > Regards, > Christian. > > > > > Signed-off-by: André Almeida <andrealmeid@igalia.com> > > --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 --- > > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 +++++++- > > 2 files changed, 7 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > index ac78caa7cba8..ea169d1689e2 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > > @@ -4771,9 +4771,6 @@ int amdgpu_device_pre_asic_reset(struct > amdgpu_device *adev, > > > > amdgpu_fence_driver_isr_toggle(adev, false); > > > > - if (job && job->vm) > > - drm_sched_increase_karma(&job->base); > > - > > r = amdgpu_reset_prepare_hwcontext(adev, reset_context); > > /* If reset handler not implemented, continue; otherwise return */ > > if (r == -ENOSYS) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > index c3d9d75143f4..097ed8f06865 100644 > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c > > @@ -51,6 +51,13 @@ static enum drm_gpu_sched_stat > amdgpu_job_timedout(struct drm_sched_job *s_job) > > memset(&ti, 0, sizeof(struct amdgpu_task_info)); > > adev->job_hang = true; > > > > + amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti); > > + > > + if (job && job->vm) { > > + DRM_INFO("marking %s context as guilty", ti.process_name); > > + drm_sched_increase_karma(&job->base); > > + } > > + > > if (amdgpu_gpu_recovery && > > amdgpu_ring_soft_recovery(ring, job->vmid, > s_job->s_fence->parent)) { > > DRM_ERROR("ring %s timeout, but soft recovered\n", > > @@ -58,7 +65,6 @@ static enum drm_gpu_sched_stat > amdgpu_job_timedout(struct drm_sched_job *s_job) > > goto exit; > > } > > > > - amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti); > > DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n", > > job->base.sched->name, > atomic_read(&ring->fence_drv.last_seq), > > ring->fence_drv.sync_seq); > >
On 4/24/23 18:45, Marek Olšák wrote:
> Soft resets are fatal just as hard resets, but no reset is "always fatal". There are cases when apps keep working depending on which features are being used. It's still unsafe.
Agreed, in theory.
In practice, from a user PoV, right now there's pretty much 0 chance of the user session surviving if the GPU context in certain critical processes (e.g. the Wayland compositor or Xwayland) hits a fatal reset. There's a > 0 chance of it surviving after a soft reset. There's ongoing work towards making user-space components more robust against fatal resets, but it's taking time. Meanwhile, I suspect most users would take the > 0 chance.
That supposedly depends on the compositor. There may be compositors for very specific cases (e.g. Steam Deck) that handle resets very well, and those would like to be properly notified of all resets because that's how they get the best outcome, e.g. no corruption. A soft reset that is unhandled by userspace may result in persistent corruption. Marek On Tue, Apr 25, 2023 at 6:27 AM Michel Dänzer <michel.daenzer@mailbox.org> wrote: > On 4/24/23 18:45, Marek Olšák wrote: > > Soft resets are fatal just as hard resets, but no reset is "always > fatal". There are cases when apps keep working depending on which features > are being used. It's still unsafe. > > Agreed, in theory. > > In practice, from a user PoV, right now there's pretty much 0 chance of > the user session surviving if the GPU context in certain critical processes > (e.g. the Wayland compositor or Xwayland) hits a fatal reset. There's a > 0 > chance of it surviving after a soft reset. There's ongoing work towards > making user-space components more robust against fatal resets, but it's > taking time. Meanwhile, I suspect most users would take the > 0 chance. > > > -- > Earthling Michel Dänzer | https://redhat.com > Libre software enthusiast | Mesa and Xwayland developer > >
Well signaling that something happened is not the question. We do this for both soft as well as hard resets. The question is if errors result in blocking further submissions with the same context or not. In case of a hard reset and potential loss of state we have to kill the context, otherwise a follow up submission would just lockup the hardware once more. In case of a soft reset I think we can keep the context alive, this way even applications without robustness handling can keep work. You potentially still get some corruption, but at least not your compositor killed. Christian. Am 25.04.23 um 13:07 schrieb Marek Olšák: > That supposedly depends on the compositor. There may be compositors > for very specific cases (e.g. Steam Deck) that handle resets very > well, and those would like to be properly notified of all resets > because that's how they get the best outcome, e.g. no corruption. A > soft reset that is unhandled by userspace may result in persistent > corruption. > > Marek > > On Tue, Apr 25, 2023 at 6:27 AM Michel Dänzer > <michel.daenzer@mailbox.org> wrote: > > On 4/24/23 18:45, Marek Olšák wrote: > > Soft resets are fatal just as hard resets, but no reset is > "always fatal". There are cases when apps keep working depending > on which features are being used. It's still unsafe. > > Agreed, in theory. > > In practice, from a user PoV, right now there's pretty much 0 > chance of the user session surviving if the GPU context in certain > critical processes (e.g. the Wayland compositor or Xwayland) hits > a fatal reset. There's a > 0 chance of it surviving after a soft > reset. There's ongoing work towards making user-space components > more robust against fatal resets, but it's taking time. Meanwhile, > I suspect most users would take the > 0 chance. > > > -- > Earthling Michel Dänzer | https://redhat.com > Libre software enthusiast | Mesa and Xwayland > developer >
On 4/25/23 14:08, Christian König wrote: > Well signaling that something happened is not the question. We do this for both soft as well as hard resets. > > The question is if errors result in blocking further submissions with the same context or not. > > In case of a hard reset and potential loss of state we have to kill the context, otherwise a follow up submission would just lockup the hardware once more. > > In case of a soft reset I think we can keep the context alive, this way even applications without robustness handling can keep work. > > You potentially still get some corruption, but at least not your compositor killed. Right, and if there is corruption, the user can restart the session. Maybe a possible compromise could be making soft resets fatal if user space enabled robustness for the context, and non-fatal if not. > Am 25.04.23 um 13:07 schrieb Marek Olšák: >> That supposedly depends on the compositor. There may be compositors for very specific cases (e.g. Steam Deck) that handle resets very well, and those would like to be properly notified of all resets because that's how they get the best outcome, e.g. no corruption. A soft reset that is unhandled by userspace may result in persistent corruption.
Am 25.04.23 um 14:14 schrieb Michel Dänzer: > On 4/25/23 14:08, Christian König wrote: >> Well signaling that something happened is not the question. We do this for both soft as well as hard resets. >> >> The question is if errors result in blocking further submissions with the same context or not. >> >> In case of a hard reset and potential loss of state we have to kill the context, otherwise a follow up submission would just lockup the hardware once more. >> >> In case of a soft reset I think we can keep the context alive, this way even applications without robustness handling can keep work. >> >> You potentially still get some corruption, but at least not your compositor killed. > Right, and if there is corruption, the user can restart the session. > > > Maybe a possible compromise could be making soft resets fatal if user space enabled robustness for the context, and non-fatal if not. Well that should already be mostly the case. If an application has enabled robustness it should notice that something went wrong and act appropriately. The only thing we need to handle is for applications without robustness in case of a hard reset or otherwise it will trigger an reset over and over again. Christian. > > >> Am 25.04.23 um 13:07 schrieb Marek Olšák: >>> That supposedly depends on the compositor. There may be compositors for very specific cases (e.g. Steam Deck) that handle resets very well, and those would like to be properly notified of all resets because that's how they get the best outcome, e.g. no corruption. A soft reset that is unhandled by userspace may result in persistent corruption. >
The last 3 comments in this thread contain arguments that are false and were specifically pointed out as false 6 comments ago: Soft resets are just as fatal as hard resets. There is nothing better about soft resets. If the VRAM is lost completely, that's a different story, and if the hard reset is 100% unreliable, that's also a different story, but other than those two outliers, there is no difference between the two from the user point view. Both can repeatedly hang if you don't prevent the app that caused the hang from using the GPU even if the app is not robust. The robustness context type doesn't matter here. By definition, no guilty app can continue after a reset, and no innocent apps affected by a reset can continue either because those can now hang too. That's how destructive all resets are. Personal anecdotes that the soft reset is better are just that, anecdotes. Marek On Tue, Apr 25, 2023, 08:44 Christian König <christian.koenig@amd.com> wrote: > Am 25.04.23 um 14:14 schrieb Michel Dänzer: > > On 4/25/23 14:08, Christian König wrote: > >> Well signaling that something happened is not the question. We do this > for both soft as well as hard resets. > >> > >> The question is if errors result in blocking further submissions with > the same context or not. > >> > >> In case of a hard reset and potential loss of state we have to kill the > context, otherwise a follow up submission would just lockup the hardware > once more. > >> > >> In case of a soft reset I think we can keep the context alive, this way > even applications without robustness handling can keep work. > >> > >> You potentially still get some corruption, but at least not your > compositor killed. > > Right, and if there is corruption, the user can restart the session. > > > > > > Maybe a possible compromise could be making soft resets fatal if user > space enabled robustness for the context, and non-fatal if not. > > Well that should already be mostly the case. If an application has > enabled robustness it should notice that something went wrong and act > appropriately. > > The only thing we need to handle is for applications without robustness > in case of a hard reset or otherwise it will trigger an reset over and > over again. > > Christian. > > > > > > >> Am 25.04.23 um 13:07 schrieb Marek Olšák: > >>> That supposedly depends on the compositor. There may be compositors > for very specific cases (e.g. Steam Deck) that handle resets very well, and > those would like to be properly notified of all resets because that's how > they get the best outcome, e.g. no corruption. A soft reset that is > unhandled by userspace may result in persistent corruption. > > > >
On 4/25/23 21:11, Marek Olšák wrote:
> The last 3 comments in this thread contain arguments that are false and were specifically pointed out as false 6 comments ago: Soft resets are just as fatal as hard resets. There is nothing better about soft resets. If the VRAM is lost completely, that's a different story, and if the hard reset is 100% unreliable, that's also a different story, but other than those two outliers, there is no difference between the two from the user point view. Both can repeatedly hang if you don't prevent the app that caused the hang from using the GPU even if the app is not robust. The robustness context type doesn't matter here. By definition, no guilty app can continue after a reset, and no innocent apps affected by a reset can continue either because those can now hang too. That's how destructive all resets are. Personal anecdotes that the soft reset is better are just that, anecdotes.
You're trying to frame the situation as black or white, but reality is shades of grey.
There's a similar situation with kernel Oopsen: In principle it's not safe to continue executing the kernel after it hits an Oops, since it might be in an inconsistent state, which could result in any kind of misbehaviour. Still, the default behaviour is to continue executing, and in most cases it turns out fine. Users which cannot accept the residual risk can choose to make the kernel panic when it hits an Oops (either via CONFIG_PANIC_ON_OOPS at build time, or via oops=panic on the kernel command line). A kernel panic means that the machine basically freezes from a user PoV, which would be worse as the default behaviour for most users (because it would e.g. incur a higher risk of losing filesystem data).
Perhaps I should clarify this. There are GL and Vulkan features that if any app uses them and its shaders are killed, the next IB will hang. One of them is Draw Indirect - if a shader is killed before storing the vertex count and instance count in memory, the next draw will hang with a high probability. No such app can be allowed to continue executing after a reset. Marek On Wed, Apr 26, 2023 at 5:51 AM Michel Dänzer <michel.daenzer@mailbox.org> wrote: > On 4/25/23 21:11, Marek Olšák wrote: > > The last 3 comments in this thread contain arguments that are false and > were specifically pointed out as false 6 comments ago: Soft resets are just > as fatal as hard resets. There is nothing better about soft resets. If the > VRAM is lost completely, that's a different story, and if the hard reset is > 100% unreliable, that's also a different story, but other than those two > outliers, there is no difference between the two from the user point view. > Both can repeatedly hang if you don't prevent the app that caused the hang > from using the GPU even if the app is not robust. The robustness context > type doesn't matter here. By definition, no guilty app can continue after a > reset, and no innocent apps affected by a reset can continue either because > those can now hang too. That's how destructive all resets are. Personal > anecdotes that the soft reset is better are just that, anecdotes. > > You're trying to frame the situation as black or white, but reality is > shades of grey. > > > There's a similar situation with kernel Oopsen: In principle it's not safe > to continue executing the kernel after it hits an Oops, since it might be > in an inconsistent state, which could result in any kind of misbehaviour. > Still, the default behaviour is to continue executing, and in most cases it > turns out fine. Users which cannot accept the residual risk can choose to > make the kernel panic when it hits an Oops (either via CONFIG_PANIC_ON_OOPS > at build time, or via oops=panic on the kernel command line). A kernel > panic means that the machine basically freezes from a user PoV, which would > be worse as the default behaviour for most users (because it would e.g. > incur a higher risk of losing filesystem data). > > > -- > Earthling Michel Dänzer | https://redhat.com > Libre software enthusiast | Mesa and Xwayland developer > >
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index ac78caa7cba8..ea169d1689e2 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4771,9 +4771,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev, amdgpu_fence_driver_isr_toggle(adev, false); - if (job && job->vm) - drm_sched_increase_karma(&job->base); - r = amdgpu_reset_prepare_hwcontext(adev, reset_context); /* If reset handler not implemented, continue; otherwise return */ if (r == -ENOSYS) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c index c3d9d75143f4..097ed8f06865 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c @@ -51,6 +51,13 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job) memset(&ti, 0, sizeof(struct amdgpu_task_info)); adev->job_hang = true; + amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti); + + if (job && job->vm) { + DRM_INFO("marking %s context as guilty", ti.process_name); + drm_sched_increase_karma(&job->base); + } + if (amdgpu_gpu_recovery && amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) { DRM_ERROR("ring %s timeout, but soft recovered\n", @@ -58,7 +65,6 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job) goto exit; } - amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti); DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n", job->base.sched->name, atomic_read(&ring->fence_drv.last_seq), ring->fence_drv.sync_seq);
When a DRM job timeout, the GPU is probably hang and amdgpu have some ways to deal with that, ranging from soft recoveries to full device reset. Anyway, when userspace ask the kernel the state of the context (via AMDGPU_CTX_OP_QUERY_STATE), the kernel reports that the device was reset, regardless if a full reset happened or not. However, amdgpu only marks a context guilty in the ASIC reset path. This makes the userspace report incomplete, given that on soft recovery path the guilty context is not told that it's the guilty one. Fix this by marking the context guilty for every type of reset when a job timeouts. Signed-off-by: André Almeida <andrealmeid@igalia.com> --- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 --- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 +++++++- 2 files changed, 7 insertions(+), 4 deletions(-)