diff mbox series

drm/amdgpu: Mark contexts guilty for any reset type

Message ID 20230424014324.218531-1-andrealmeid@igalia.com (mailing list archive)
State New, archived
Headers show
Series drm/amdgpu: Mark contexts guilty for any reset type | expand

Commit Message

André Almeida April 24, 2023, 1:43 a.m. UTC
When a DRM job timeout, the GPU is probably hang and amdgpu have some
ways to deal with that, ranging from soft recoveries to full device
reset. Anyway, when userspace ask the kernel the state of the context
(via AMDGPU_CTX_OP_QUERY_STATE), the kernel reports that the device was
reset, regardless if a full reset happened or not.

However, amdgpu only marks a context guilty in the ASIC reset path. This
makes the userspace report incomplete, given that on soft recovery path
the guilty context is not told that it's the guilty one.

Fix this by marking the context guilty for every type of reset when a
job timeouts.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 8 +++++++-
 2 files changed, 7 insertions(+), 4 deletions(-)

Comments

kernel test robot April 24, 2023, 5:11 a.m. UTC | #1
Hi André,

kernel test robot noticed the following build warnings:

[auto build test WARNING on drm-misc/drm-misc-next]
[also build test WARNING on drm/drm-next drm-exynos/exynos-drm-next drm-intel/for-linux-next drm-intel/for-linux-next-fixes drm-tip/drm-tip linus/master v6.3 next-20230421]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Andr-Almeida/drm-amdgpu-Mark-contexts-guilty-for-any-reset-type/20230424-094534
base:   git://anongit.freedesktop.org/drm/drm-misc drm-misc-next
patch link:    https://lore.kernel.org/r/20230424014324.218531-1-andrealmeid%40igalia.com
patch subject: [PATCH] drm/amdgpu: Mark contexts guilty for any reset type
config: s390-allyesconfig (https://download.01.org/0day-ci/archive/20230424/202304241259.Qq0Dmlud-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/ea7b1d78b677fdcf5f4776e63de611a2681cd5fb
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Andr-Almeida/drm-amdgpu-Mark-contexts-guilty-for-any-reset-type/20230424-094534
        git checkout ea7b1d78b677fdcf5f4776e63de611a2681cd5fb
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 SHELL=/bin/bash drivers/gpu/drm/amd/amdgpu/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304241259.Qq0Dmlud-lkp@intel.com/

All warnings (new ones prefixed by >>):

   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c: In function 'amdgpu_device_pre_asic_reset':
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:4738:28: warning: variable 'job' set but not used [-Wunused-but-set-variable]
    4738 |         struct amdgpu_job *job = NULL;
         |                            ^~~


vim +/job +4738 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

5740682e66cef5 Monk Liu          2017-10-25  4733  
e3c1b0712fdb03 shaoyunl          2021-02-16  4734  int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
04442bf70debb1 Lijo Lazar        2021-03-16  4735  				 struct amdgpu_reset_context *reset_context)
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4736  {
5c1e6fa49e8d8d Huang Rui         2021-12-16  4737  	int i, r = 0;
04442bf70debb1 Lijo Lazar        2021-03-16 @4738  	struct amdgpu_job *job = NULL;
04442bf70debb1 Lijo Lazar        2021-03-16  4739  	bool need_full_reset =
04442bf70debb1 Lijo Lazar        2021-03-16  4740  		test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
04442bf70debb1 Lijo Lazar        2021-03-16  4741  
04442bf70debb1 Lijo Lazar        2021-03-16  4742  	if (reset_context->reset_req_dev == adev)
04442bf70debb1 Lijo Lazar        2021-03-16  4743  		job = reset_context->job;
711826656bebb0 Monk Liu          2017-12-25  4744  
b602ca5f31fe69 Tiecheng Zhou     2020-08-19  4745  	if (amdgpu_sriov_vf(adev)) {
b602ca5f31fe69 Tiecheng Zhou     2020-08-19  4746  		/* stop the data exchange thread */
b602ca5f31fe69 Tiecheng Zhou     2020-08-19  4747  		amdgpu_virt_fini_data_exchange(adev);
b602ca5f31fe69 Tiecheng Zhou     2020-08-19  4748  	}
b602ca5f31fe69 Tiecheng Zhou     2020-08-19  4749  
9e225fb9e636b3 Andrey Grodzovsky 2022-06-18  4750  	amdgpu_fence_driver_isr_toggle(adev, true);
9e225fb9e636b3 Andrey Grodzovsky 2022-06-18  4751  
711826656bebb0 Monk Liu          2017-12-25  4752  	/* block all schedulers and reset given job's ring */
0875dc9e80eb3b Chunming Zhou     2016-06-12  4753  	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
0875dc9e80eb3b Chunming Zhou     2016-06-12  4754  		struct amdgpu_ring *ring = adev->rings[i];
0875dc9e80eb3b Chunming Zhou     2016-06-12  4755  
51687759be93fb Chunming Zhou     2017-04-24  4756  		if (!ring || !ring->sched.thread)
0875dc9e80eb3b Chunming Zhou     2016-06-12  4757  			continue;
5740682e66cef5 Monk Liu          2017-10-25  4758  
c530b02f39850a Jack Zhang        2021-05-12  4759  		/*clear job fence from fence drv to avoid force_completion
c530b02f39850a Jack Zhang        2021-05-12  4760  		 *leave NULL and vm flush fence in fence drv */
5c1e6fa49e8d8d Huang Rui         2021-12-16  4761  		amdgpu_fence_driver_clear_job_fences(ring);
c530b02f39850a Jack Zhang        2021-05-12  4762  
2200edac745a65 Chunming Zhou     2016-06-30  4763  		/* after all hw jobs are reset, hw fence is meaningless, so force_completion */
2f9d4084cac96a Monk Liu          2017-10-16  4764  		amdgpu_fence_driver_force_completion(ring);
2f9d4084cac96a Monk Liu          2017-10-16  4765  	}
d38ceaf99ed015 Alex Deucher      2015-04-20  4766  
9e225fb9e636b3 Andrey Grodzovsky 2022-06-18  4767  	amdgpu_fence_driver_isr_toggle(adev, false);
9e225fb9e636b3 Andrey Grodzovsky 2022-06-18  4768  
04442bf70debb1 Lijo Lazar        2021-03-16  4769  	r = amdgpu_reset_prepare_hwcontext(adev, reset_context);
404b277bbe4945 Lijo Lazar        2021-03-26  4770  	/* If reset handler not implemented, continue; otherwise return */
404b277bbe4945 Lijo Lazar        2021-03-26  4771  	if (r == -ENOSYS)
404b277bbe4945 Lijo Lazar        2021-03-26  4772  		r = 0;
404b277bbe4945 Lijo Lazar        2021-03-26  4773  	else
04442bf70debb1 Lijo Lazar        2021-03-16  4774  		return r;
04442bf70debb1 Lijo Lazar        2021-03-16  4775  
1d721ed679db18 Andrey Grodzovsky 2019-04-18  4776  	/* Don't suspend on bare metal if we are not going to HW reset the ASIC */
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4777  	if (!amdgpu_sriov_vf(adev)) {
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4778  
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4779  		if (!need_full_reset)
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4780  			need_full_reset = amdgpu_device_ip_need_full_reset(adev);
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4781  
360cd08196cabc Likun Gao         2022-12-21  4782  		if (!need_full_reset && amdgpu_gpu_recovery &&
360cd08196cabc Likun Gao         2022-12-21  4783  		    amdgpu_device_ip_check_soft_reset(adev)) {
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4784  			amdgpu_device_ip_pre_soft_reset(adev);
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4785  			r = amdgpu_device_ip_soft_reset(adev);
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4786  			amdgpu_device_ip_post_soft_reset(adev);
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4787  			if (r || amdgpu_device_ip_check_soft_reset(adev)) {
aac891685da661 Dennis Li         2020-08-20  4788  				dev_info(adev->dev, "soft reset failed, will fallback to full reset!\n");
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4789  				need_full_reset = true;
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4790  			}
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4791  		}
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4792  
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4793  		if (need_full_reset)
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4794  			r = amdgpu_device_ip_suspend(adev);
04442bf70debb1 Lijo Lazar        2021-03-16  4795  		if (need_full_reset)
04442bf70debb1 Lijo Lazar        2021-03-16  4796  			set_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
04442bf70debb1 Lijo Lazar        2021-03-16  4797  		else
04442bf70debb1 Lijo Lazar        2021-03-16  4798  			clear_bit(AMDGPU_NEED_FULL_RESET,
04442bf70debb1 Lijo Lazar        2021-03-16  4799  				  &reset_context->flags);
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4800  	}
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4801  
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4802  	return r;
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4803  }
26bc534094ed45 Andrey Grodzovsky 2018-11-22  4804
Christian König April 24, 2023, 7:03 a.m. UTC | #2
Am 24.04.23 um 03:43 schrieb André Almeida:
> When a DRM job timeout, the GPU is probably hang and amdgpu have some
> ways to deal with that, ranging from soft recoveries to full device
> reset. Anyway, when userspace ask the kernel the state of the context
> (via AMDGPU_CTX_OP_QUERY_STATE), the kernel reports that the device was
> reset, regardless if a full reset happened or not.
>
> However, amdgpu only marks a context guilty in the ASIC reset path. This
> makes the userspace report incomplete, given that on soft recovery path
> the guilty context is not told that it's the guilty one.
>
> Fix this by marking the context guilty for every type of reset when a
> job timeouts.

The guilty handling is pretty much broken by design and only works 
because we go through multiple hops of validating the entity after the 
job has already been pushed to the hw.

I think we should probably just remove that completely and use an 
approach where we check the in flight submissions in the query state 
IOCTL. See my other patch on the mailing list regarding that.

Additional to that I currently didn't considered soft-recovered 
submissions as fatal and continue accepting submissions from that 
context, but already wanted to talk with Marek about that behavior.

Regards,
Christian.

>
> Signed-off-by: André Almeida <andrealmeid@igalia.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 8 +++++++-
>   2 files changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index ac78caa7cba8..ea169d1689e2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4771,9 +4771,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>   
>   	amdgpu_fence_driver_isr_toggle(adev, false);
>   
> -	if (job && job->vm)
> -		drm_sched_increase_karma(&job->base);
> -
>   	r = amdgpu_reset_prepare_hwcontext(adev, reset_context);
>   	/* If reset handler not implemented, continue; otherwise return */
>   	if (r == -ENOSYS)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index c3d9d75143f4..097ed8f06865 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -51,6 +51,13 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   	memset(&ti, 0, sizeof(struct amdgpu_task_info));
>   	adev->job_hang = true;
>   
> +	amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
> +
> +	if (job && job->vm) {
> +		DRM_INFO("marking %s context as guilty", ti.process_name);
> +		drm_sched_increase_karma(&job->base);
> +	}
> +
>   	if (amdgpu_gpu_recovery &&
>   	    amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) {
>   		DRM_ERROR("ring %s timeout, but soft recovered\n",
> @@ -58,7 +65,6 @@ static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
>   		goto exit;
>   	}
>   
> -	amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
>   	DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
>   		  job->base.sched->name, atomic_read(&ring->fence_drv.last_seq),
>   		  ring->fence_drv.sync_seq);
André Almeida April 24, 2023, 1:26 p.m. UTC | #3
Hi Christian, thank you for your comments.

Em 24/04/2023 04:03, Christian König escreveu:
> Am 24.04.23 um 03:43 schrieb André Almeida:
>> When a DRM job timeout, the GPU is probably hang and amdgpu have some
>> ways to deal with that, ranging from soft recoveries to full device
>> reset. Anyway, when userspace ask the kernel the state of the context
>> (via AMDGPU_CTX_OP_QUERY_STATE), the kernel reports that the device was
>> reset, regardless if a full reset happened or not.
>>
>> However, amdgpu only marks a context guilty in the ASIC reset path. This
>> makes the userspace report incomplete, given that on soft recovery path
>> the guilty context is not told that it's the guilty one.
>>
>> Fix this by marking the context guilty for every type of reset when a
>> job timeouts.
> 
> The guilty handling is pretty much broken by design and only works 
> because we go through multiple hops of validating the entity after the 
> job has already been pushed to the hw.

I see, thanks.

> 
> I think we should probably just remove that completely and use an 
> approach where we check the in flight submissions in the query state 
> IOCTL.

Like the DRM_IOCTL_I915_GET_RESET_STATS approach?

 > See my other patch on the mailing list regarding that.

Which one, the "[PATCH 1/8] drm/scheduler: properly forward fence 
errors" series?

> 
> Additional to that I currently didn't considered soft-recovered 
> submissions as fatal and continue accepting submissions from that 
> context, but already wanted to talk with Marek about that behavior.
> 

Interesting. I will try to test and validate this approach to see if the 
contexts keep working as expected on soft-resets.

> Regards,
> Christian.
Michel Dänzer April 24, 2023, 3:27 p.m. UTC | #4
On 4/24/23 15:26, André Almeida wrote:
>>
>> Additional to that I currently didn't considered soft-recovered submissions as fatal and continue accepting submissions from that context, but already wanted to talk with Marek about that behavior.
>>
> 
> Interesting. I will try to test and validate this approach to see if the contexts keep working as expected on soft-resets.

FWIW, on this Thinkpad E595 with a Picasso APU, I've hit soft-resets (usually either in Firefox or gnome-shell) a number of times, and usually continued using the GNOME session for a few days without any issues.

(Interestingly, Firefox reacts to the soft-resets by falling back to software rendering, even when it's not guilty itself)
Marek Olšák April 24, 2023, 4:45 p.m. UTC | #5
Soft resets are fatal just as hard resets, but no reset is "always fatal".
There are cases when apps keep working depending on which features are
being used. It's still unsafe.

Marek

On Mon, Apr 24, 2023, 03:03 Christian König <christian.koenig@amd.com>
wrote:

> Am 24.04.23 um 03:43 schrieb André Almeida:
> > When a DRM job timeout, the GPU is probably hang and amdgpu have some
> > ways to deal with that, ranging from soft recoveries to full device
> > reset. Anyway, when userspace ask the kernel the state of the context
> > (via AMDGPU_CTX_OP_QUERY_STATE), the kernel reports that the device was
> > reset, regardless if a full reset happened or not.
> >
> > However, amdgpu only marks a context guilty in the ASIC reset path. This
> > makes the userspace report incomplete, given that on soft recovery path
> > the guilty context is not told that it's the guilty one.
> >
> > Fix this by marking the context guilty for every type of reset when a
> > job timeouts.
>
> The guilty handling is pretty much broken by design and only works
> because we go through multiple hops of validating the entity after the
> job has already been pushed to the hw.
>
> I think we should probably just remove that completely and use an
> approach where we check the in flight submissions in the query state
> IOCTL. See my other patch on the mailing list regarding that.
>
> Additional to that I currently didn't considered soft-recovered
> submissions as fatal and continue accepting submissions from that
> context, but already wanted to talk with Marek about that behavior.
>
> Regards,
> Christian.
>
> >
> > Signed-off-by: André Almeida <andrealmeid@igalia.com>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 8 +++++++-
> >   2 files changed, 7 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index ac78caa7cba8..ea169d1689e2 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -4771,9 +4771,6 @@ int amdgpu_device_pre_asic_reset(struct
> amdgpu_device *adev,
> >
> >       amdgpu_fence_driver_isr_toggle(adev, false);
> >
> > -     if (job && job->vm)
> > -             drm_sched_increase_karma(&job->base);
> > -
> >       r = amdgpu_reset_prepare_hwcontext(adev, reset_context);
> >       /* If reset handler not implemented, continue; otherwise return */
> >       if (r == -ENOSYS)
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index c3d9d75143f4..097ed8f06865 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -51,6 +51,13 @@ static enum drm_gpu_sched_stat
> amdgpu_job_timedout(struct drm_sched_job *s_job)
> >       memset(&ti, 0, sizeof(struct amdgpu_task_info));
> >       adev->job_hang = true;
> >
> > +     amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
> > +
> > +     if (job && job->vm) {
> > +             DRM_INFO("marking %s context as guilty", ti.process_name);
> > +             drm_sched_increase_karma(&job->base);
> > +     }
> > +
> >       if (amdgpu_gpu_recovery &&
> >           amdgpu_ring_soft_recovery(ring, job->vmid,
> s_job->s_fence->parent)) {
> >               DRM_ERROR("ring %s timeout, but soft recovered\n",
> > @@ -58,7 +65,6 @@ static enum drm_gpu_sched_stat
> amdgpu_job_timedout(struct drm_sched_job *s_job)
> >               goto exit;
> >       }
> >
> > -     amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
> >       DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
> >                 job->base.sched->name,
> atomic_read(&ring->fence_drv.last_seq),
> >                 ring->fence_drv.sync_seq);
>
>
Michel Dänzer April 25, 2023, 10:27 a.m. UTC | #6
On 4/24/23 18:45, Marek Olšák wrote:
> Soft resets are fatal just as hard resets, but no reset is "always fatal". There are cases when apps keep working depending on which features are being used. It's still unsafe.

Agreed, in theory.

In practice, from a user PoV, right now there's pretty much 0 chance of the user session surviving if the GPU context in certain critical processes (e.g. the Wayland compositor or Xwayland) hits a fatal reset. There's a > 0 chance of it surviving after a soft reset. There's ongoing work towards making user-space components more robust against fatal resets, but it's taking time. Meanwhile, I suspect most users would take the > 0 chance.
Marek Olšák April 25, 2023, 11:07 a.m. UTC | #7
That supposedly depends on the compositor. There may be compositors for
very specific cases (e.g. Steam Deck) that handle resets very well, and
those would like to be properly notified of all resets because that's how
they get the best outcome, e.g. no corruption. A soft reset that is
unhandled by userspace may result in persistent corruption.

Marek

On Tue, Apr 25, 2023 at 6:27 AM Michel Dänzer <michel.daenzer@mailbox.org>
wrote:

> On 4/24/23 18:45, Marek Olšák wrote:
> > Soft resets are fatal just as hard resets, but no reset is "always
> fatal". There are cases when apps keep working depending on which features
> are being used. It's still unsafe.
>
> Agreed, in theory.
>
> In practice, from a user PoV, right now there's pretty much 0 chance of
> the user session surviving if the GPU context in certain critical processes
> (e.g. the Wayland compositor or Xwayland) hits a fatal reset. There's a > 0
> chance of it surviving after a soft reset. There's ongoing work towards
> making user-space components more robust against fatal resets, but it's
> taking time. Meanwhile, I suspect most users would take the > 0 chance.
>
>
> --
> Earthling Michel Dänzer            |                  https://redhat.com
> Libre software enthusiast          |         Mesa and Xwayland developer
>
>
Christian König April 25, 2023, 12:08 p.m. UTC | #8
Well signaling that something happened is not the question. We do this 
for both soft as well as hard resets.

The question is if errors result in blocking further submissions with 
the same context or not.

In case of a hard reset and potential loss of state we have to kill the 
context, otherwise a follow up submission would just lockup the hardware 
once more.

In case of a soft reset I think we can keep the context alive, this way 
even applications without robustness handling can keep work.

You potentially still get some corruption, but at least not your 
compositor killed.

Christian.

Am 25.04.23 um 13:07 schrieb Marek Olšák:
> That supposedly depends on the compositor. There may be compositors 
> for very specific cases (e.g. Steam Deck) that handle resets very 
> well, and those would like to be properly notified of all resets 
> because that's how they get the best outcome, e.g. no corruption. A 
> soft reset that is unhandled by userspace may result in persistent 
> corruption.
>
> Marek
>
> On Tue, Apr 25, 2023 at 6:27 AM Michel Dänzer 
> <michel.daenzer@mailbox.org> wrote:
>
>     On 4/24/23 18:45, Marek Olšák wrote:
>     > Soft resets are fatal just as hard resets, but no reset is
>     "always fatal". There are cases when apps keep working depending
>     on which features are being used. It's still unsafe.
>
>     Agreed, in theory.
>
>     In practice, from a user PoV, right now there's pretty much 0
>     chance of the user session surviving if the GPU context in certain
>     critical processes (e.g. the Wayland compositor or Xwayland) hits
>     a fatal reset. There's a > 0 chance of it surviving after a soft
>     reset. There's ongoing work towards making user-space components
>     more robust against fatal resets, but it's taking time. Meanwhile,
>     I suspect most users would take the > 0 chance.
>
>
>     -- 
>     Earthling Michel Dänzer            | https://redhat.com
>     Libre software enthusiast          |         Mesa and Xwayland
>     developer
>
Michel Dänzer April 25, 2023, 12:14 p.m. UTC | #9
On 4/25/23 14:08, Christian König wrote:
> Well signaling that something happened is not the question. We do this for both soft as well as hard resets.
> 
> The question is if errors result in blocking further submissions with the same context or not.
> 
> In case of a hard reset and potential loss of state we have to kill the context, otherwise a follow up submission would just lockup the hardware once more.
> 
> In case of a soft reset I think we can keep the context alive, this way even applications without robustness handling can keep work.
> 
> You potentially still get some corruption, but at least not your compositor killed.

Right, and if there is corruption, the user can restart the session.


Maybe a possible compromise could be making soft resets fatal if user space enabled robustness for the context, and non-fatal if not.


> Am 25.04.23 um 13:07 schrieb Marek Olšák:
>> That supposedly depends on the compositor. There may be compositors for very specific cases (e.g. Steam Deck) that handle resets very well, and those would like to be properly notified of all resets because that's how they get the best outcome, e.g. no corruption. A soft reset that is unhandled by userspace may result in persistent corruption.
Christian König April 25, 2023, 12:44 p.m. UTC | #10
Am 25.04.23 um 14:14 schrieb Michel Dänzer:
> On 4/25/23 14:08, Christian König wrote:
>> Well signaling that something happened is not the question. We do this for both soft as well as hard resets.
>>
>> The question is if errors result in blocking further submissions with the same context or not.
>>
>> In case of a hard reset and potential loss of state we have to kill the context, otherwise a follow up submission would just lockup the hardware once more.
>>
>> In case of a soft reset I think we can keep the context alive, this way even applications without robustness handling can keep work.
>>
>> You potentially still get some corruption, but at least not your compositor killed.
> Right, and if there is corruption, the user can restart the session.
>
>
> Maybe a possible compromise could be making soft resets fatal if user space enabled robustness for the context, and non-fatal if not.

Well that should already be mostly the case. If an application has 
enabled robustness it should notice that something went wrong and act 
appropriately.

The only thing we need to handle is for applications without robustness 
in case of a hard reset or otherwise it will trigger an reset over and 
over again.

Christian.

>
>
>> Am 25.04.23 um 13:07 schrieb Marek Olšák:
>>> That supposedly depends on the compositor. There may be compositors for very specific cases (e.g. Steam Deck) that handle resets very well, and those would like to be properly notified of all resets because that's how they get the best outcome, e.g. no corruption. A soft reset that is unhandled by userspace may result in persistent corruption.
>
Marek Olšák April 25, 2023, 7:11 p.m. UTC | #11
The last 3 comments in this thread contain arguments that are false and
were specifically pointed out as false 6 comments ago: Soft resets are just
as fatal as hard resets. There is nothing better about soft resets. If the
VRAM is lost completely, that's a different story, and if the hard reset is
100% unreliable, that's also a different story, but other than those two
outliers, there is no difference between the two from the user point view.
Both can repeatedly hang if you don't prevent the app that caused the hang
from using the GPU even if the app is not robust. The robustness context
type doesn't matter here. By definition, no guilty app can continue after a
reset, and no innocent apps affected by a reset can continue either because
those can now hang too. That's how destructive all resets are. Personal
anecdotes that the soft reset is better are just that, anecdotes.

Marek

On Tue, Apr 25, 2023, 08:44 Christian König <christian.koenig@amd.com>
wrote:

> Am 25.04.23 um 14:14 schrieb Michel Dänzer:
> > On 4/25/23 14:08, Christian König wrote:
> >> Well signaling that something happened is not the question. We do this
> for both soft as well as hard resets.
> >>
> >> The question is if errors result in blocking further submissions with
> the same context or not.
> >>
> >> In case of a hard reset and potential loss of state we have to kill the
> context, otherwise a follow up submission would just lockup the hardware
> once more.
> >>
> >> In case of a soft reset I think we can keep the context alive, this way
> even applications without robustness handling can keep work.
> >>
> >> You potentially still get some corruption, but at least not your
> compositor killed.
> > Right, and if there is corruption, the user can restart the session.
> >
> >
> > Maybe a possible compromise could be making soft resets fatal if user
> space enabled robustness for the context, and non-fatal if not.
>
> Well that should already be mostly the case. If an application has
> enabled robustness it should notice that something went wrong and act
> appropriately.
>
> The only thing we need to handle is for applications without robustness
> in case of a hard reset or otherwise it will trigger an reset over and
> over again.
>
> Christian.
>
> >
> >
> >> Am 25.04.23 um 13:07 schrieb Marek Olšák:
> >>> That supposedly depends on the compositor. There may be compositors
> for very specific cases (e.g. Steam Deck) that handle resets very well, and
> those would like to be properly notified of all resets because that's how
> they get the best outcome, e.g. no corruption. A soft reset that is
> unhandled by userspace may result in persistent corruption.
> >
>
>
Michel Dänzer April 26, 2023, 9:51 a.m. UTC | #12
On 4/25/23 21:11, Marek Olšák wrote:
> The last 3 comments in this thread contain arguments that are false and were specifically pointed out as false 6 comments ago: Soft resets are just as fatal as hard resets. There is nothing better about soft resets. If the VRAM is lost completely, that's a different story, and if the hard reset is 100% unreliable, that's also a different story, but other than those two outliers, there is no difference between the two from the user point view. Both can repeatedly hang if you don't prevent the app that caused the hang from using the GPU even if the app is not robust. The robustness context type doesn't matter here. By definition, no guilty app can continue after a reset, and no innocent apps affected by a reset can continue either because those can now hang too. That's how destructive all resets are. Personal anecdotes that the soft reset is better are just that, anecdotes.

You're trying to frame the situation as black or white, but reality is shades of grey.


There's a similar situation with kernel Oopsen: In principle it's not safe to continue executing the kernel after it hits an Oops, since it might be in an inconsistent state, which could result in any kind of misbehaviour. Still, the default behaviour is to continue executing, and in most cases it turns out fine. Users which cannot accept the residual risk can choose to make the kernel panic when it hits an Oops (either via CONFIG_PANIC_ON_OOPS at build time, or via oops=panic on the kernel command line). A kernel panic means that the machine basically freezes from a user PoV, which would be worse as the default behaviour for most users (because it would e.g. incur a higher risk of losing filesystem data).
Marek Olšák April 26, 2023, 3:52 p.m. UTC | #13
Perhaps I should clarify this. There are GL and Vulkan features that if any
app uses them and its shaders are killed, the next IB will hang. One of
them is Draw Indirect - if a shader is killed before storing the vertex
count and instance count in memory, the next draw will hang with a high
probability. No such app can be allowed to continue executing after a reset.

Marek

On Wed, Apr 26, 2023 at 5:51 AM Michel Dänzer <michel.daenzer@mailbox.org>
wrote:

> On 4/25/23 21:11, Marek Olšák wrote:
> > The last 3 comments in this thread contain arguments that are false and
> were specifically pointed out as false 6 comments ago: Soft resets are just
> as fatal as hard resets. There is nothing better about soft resets. If the
> VRAM is lost completely, that's a different story, and if the hard reset is
> 100% unreliable, that's also a different story, but other than those two
> outliers, there is no difference between the two from the user point view.
> Both can repeatedly hang if you don't prevent the app that caused the hang
> from using the GPU even if the app is not robust. The robustness context
> type doesn't matter here. By definition, no guilty app can continue after a
> reset, and no innocent apps affected by a reset can continue either because
> those can now hang too. That's how destructive all resets are. Personal
> anecdotes that the soft reset is better are just that, anecdotes.
>
> You're trying to frame the situation as black or white, but reality is
> shades of grey.
>
>
> There's a similar situation with kernel Oopsen: In principle it's not safe
> to continue executing the kernel after it hits an Oops, since it might be
> in an inconsistent state, which could result in any kind of misbehaviour.
> Still, the default behaviour is to continue executing, and in most cases it
> turns out fine. Users which cannot accept the residual risk can choose to
> make the kernel panic when it hits an Oops (either via CONFIG_PANIC_ON_OOPS
> at build time, or via oops=panic on the kernel command line). A kernel
> panic means that the machine basically freezes from a user PoV, which would
> be worse as the default behaviour for most users (because it would e.g.
> incur a higher risk of losing filesystem data).
>
>
> --
> Earthling Michel Dänzer            |                  https://redhat.com
> Libre software enthusiast          |         Mesa and Xwayland developer
>
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index ac78caa7cba8..ea169d1689e2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4771,9 +4771,6 @@  int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
 
 	amdgpu_fence_driver_isr_toggle(adev, false);
 
-	if (job && job->vm)
-		drm_sched_increase_karma(&job->base);
-
 	r = amdgpu_reset_prepare_hwcontext(adev, reset_context);
 	/* If reset handler not implemented, continue; otherwise return */
 	if (r == -ENOSYS)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index c3d9d75143f4..097ed8f06865 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -51,6 +51,13 @@  static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 	memset(&ti, 0, sizeof(struct amdgpu_task_info));
 	adev->job_hang = true;
 
+	amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
+
+	if (job && job->vm) {
+		DRM_INFO("marking %s context as guilty", ti.process_name);
+		drm_sched_increase_karma(&job->base);
+	}
+
 	if (amdgpu_gpu_recovery &&
 	    amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) {
 		DRM_ERROR("ring %s timeout, but soft recovered\n",
@@ -58,7 +65,6 @@  static enum drm_gpu_sched_stat amdgpu_job_timedout(struct drm_sched_job *s_job)
 		goto exit;
 	}
 
-	amdgpu_vm_get_task_info(ring->adev, job->pasid, &ti);
 	DRM_ERROR("ring %s timeout, signaled seq=%u, emitted seq=%u\n",
 		  job->base.sched->name, atomic_read(&ring->fence_drv.last_seq),
 		  ring->fence_drv.sync_seq);