diff mbox series

drm/i915/gt: Clear wedged status upon suspend

Message ID 20230124110515.17017-1-nirmoy.das@intel.com (mailing list archive)
State New, archived
Headers show
Series drm/i915/gt: Clear wedged status upon suspend | expand

Commit Message

Nirmoy Das Jan. 24, 2023, 11:05 a.m. UTC
From: Chris Wilson <chris.p.wilson@linux.intel.com>

Currently we use set-wedged on suspend if the workload is not responding
in order to allow a fast suspend (albeit at the cost of discarding the
current userspace). This may leave the device wedged during suspend,
where we may require the device available in order to swapout CPU
inaccessible device memory. Clear any temporary wedged-status after
flushing userspace off the device so we can use the blitter ourselves
inside suspend.

Testcase: igt/gem_eio/in-flight-suspend
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Signed-off-by: Chris Wilson <chris.p.wilson@linux.intel.com>
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_gt_pm.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

Comments

Nirmoy Das Jan. 24, 2023, 11:07 a.m. UTC | #1
Forgot to add the drm issue a reference.

On 1/24/2023 12:05 PM, Nirmoy Das wrote:
> From: Chris Wilson <chris.p.wilson@linux.intel.com>
>
> Currently we use set-wedged on suspend if the workload is not responding
> in order to allow a fast suspend (albeit at the cost of discarding the
> current userspace). This may leave the device wedged during suspend,
> where we may require the device available in order to swapout CPU
> inaccessible device memory. Clear any temporary wedged-status after
> flushing userspace off the device so we can use the blitter ourselves
> inside suspend.
>
> Testcase: igt/gem_eio/in-flight-suspend
References: https://gitlab.freedesktop.org/drm/intel/-/issues/7896
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Signed-off-by: Chris Wilson <chris.p.wilson@linux.intel.com>
> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_gt_pm.c | 10 ++++------
>   1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> index cef3d6f5c34e..74d1dd3793f9 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> @@ -317,19 +317,17 @@ int intel_gt_resume(struct intel_gt *gt)
>   
>   static void wait_for_suspend(struct intel_gt *gt)
>   {
> -	if (!intel_gt_pm_is_awake(gt))
> -		return;
> -
> -	if (intel_gt_wait_for_idle(gt, I915_GT_SUSPEND_IDLE_TIMEOUT) == -ETIME) {
> +	if (intel_gt_wait_for_idle(gt, I915_GT_SUSPEND_IDLE_TIMEOUT) == -ETIME)
>   		/*
>   		 * Forcibly cancel outstanding work and leave
>   		 * the gpu quiet.
>   		 */
>   		intel_gt_set_wedged(gt);
> -		intel_gt_retire_requests(gt);
> -	}
>   
>   	intel_gt_pm_wait_for_idle(gt);
> +
> +	/* Make the GPU available again for swapout */
> +	intel_gt_unset_wedged(gt);
>   }
>   
>   void intel_gt_suspend_prepare(struct intel_gt *gt)
Rodrigo Vivi Jan. 24, 2023, 7:26 p.m. UTC | #2
On Tue, Jan 24, 2023 at 12:07:19PM +0100, Das, Nirmoy wrote:
> Forgot to add the drm issue a reference.
> 
> On 1/24/2023 12:05 PM, Nirmoy Das wrote:
> > From: Chris Wilson <chris.p.wilson@linux.intel.com>
> > 
> > Currently we use set-wedged on suspend if the workload is not responding
> > in order to allow a fast suspend (albeit at the cost of discarding the
> > current userspace). This may leave the device wedged during suspend,
> > where we may require the device available in order to swapout CPU
> > inaccessible device memory. Clear any temporary wedged-status after
> > flushing userspace off the device so we can use the blitter ourselves
> > inside suspend.

This seems a very good move. But this explain they unset_wedged part,
not the removal of the retire_requests. Why don't we need to retire them
anymore?

Also, what are the chances of races here? I mean, we are marking
the gpu as not wedged anymore. Do we have any warranty at this point
that no further request will arrive?

Shouldn't we have a way to differentiate between the totally wedged
and blocked for user submission?

> > 
> > Testcase: igt/gem_eio/in-flight-suspend
> References: https://gitlab.freedesktop.org/drm/intel/-/issues/7896
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Signed-off-by: Chris Wilson <chris.p.wilson@linux.intel.com>
> > Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_gt_pm.c | 10 ++++------
> >   1 file changed, 4 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > index cef3d6f5c34e..74d1dd3793f9 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> > @@ -317,19 +317,17 @@ int intel_gt_resume(struct intel_gt *gt)
> >   static void wait_for_suspend(struct intel_gt *gt)
> >   {
> > -	if (!intel_gt_pm_is_awake(gt))
> > -		return;
> > -
> > -	if (intel_gt_wait_for_idle(gt, I915_GT_SUSPEND_IDLE_TIMEOUT) == -ETIME) {
> > +	if (intel_gt_wait_for_idle(gt, I915_GT_SUSPEND_IDLE_TIMEOUT) == -ETIME)
> >   		/*
> >   		 * Forcibly cancel outstanding work and leave
> >   		 * the gpu quiet.
> >   		 */
> >   		intel_gt_set_wedged(gt);
> > -		intel_gt_retire_requests(gt);
> > -	}
> >   	intel_gt_pm_wait_for_idle(gt);
> > +
> > +	/* Make the GPU available again for swapout */
> > +	intel_gt_unset_wedged(gt);
> >   }
> >   void intel_gt_suspend_prepare(struct intel_gt *gt)
Nirmoy Das Jan. 25, 2023, 1:28 p.m. UTC | #3
Hi Rodrigo,

On 1/24/2023 8:26 PM, Rodrigo Vivi wrote:
> On Tue, Jan 24, 2023 at 12:07:19PM +0100, Das, Nirmoy wrote:
>> Forgot to add the drm issue a reference.
>>
>> On 1/24/2023 12:05 PM, Nirmoy Das wrote:
>>> From: Chris Wilson <chris.p.wilson@linux.intel.com>
>>>
>>> Currently we use set-wedged on suspend if the workload is not responding
>>> in order to allow a fast suspend (albeit at the cost of discarding the
>>> current userspace). This may leave the device wedged during suspend,
>>> where we may require the device available in order to swapout CPU
>>> inaccessible device memory. Clear any temporary wedged-status after
>>> flushing userspace off the device so we can use the blitter ourselves
>>> inside suspend.
> This seems a very good move. But this explain they unset_wedged part,
> not the removal of the retire_requests. Why don't we need to retire them
> anymore?


Thanks for noticing that. This on me, I missed another patch which moved 
the intel_gt_retire_requests()

inside of intel_gt_set_wedged().

>
> Also, what are the chances of races here? I mean, we are marking
> the gpu as not wedged anymore. Do we have any warranty at this point
> that no further request will arrive?


The assumption was: this isĀ  in single threaded suspend "context" so we 
should be fine but

we just realized thatĀ  this is getting called at pm prepare time. Thanks 
for raising this it seem

I need to refactor i915_gem_backup_suspend() as well which should be 
called much later on.


Regards,

Nirmoy

>
> Shouldn't we have a way to differentiate between the totally wedged
> and blocked for user submission?
>
>>> Testcase: igt/gem_eio/in-flight-suspend
>> References: https://gitlab.freedesktop.org/drm/intel/-/issues/7896
>>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> Signed-off-by: Chris Wilson <chris.p.wilson@linux.intel.com>
>>> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/gt/intel_gt_pm.c | 10 ++++------
>>>    1 file changed, 4 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>>> index cef3d6f5c34e..74d1dd3793f9 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>>> @@ -317,19 +317,17 @@ int intel_gt_resume(struct intel_gt *gt)
>>>    static void wait_for_suspend(struct intel_gt *gt)
>>>    {
>>> -	if (!intel_gt_pm_is_awake(gt))
>>> -		return;
>>> -
>>> -	if (intel_gt_wait_for_idle(gt, I915_GT_SUSPEND_IDLE_TIMEOUT) == -ETIME) {
>>> +	if (intel_gt_wait_for_idle(gt, I915_GT_SUSPEND_IDLE_TIMEOUT) == -ETIME)
>>>    		/*
>>>    		 * Forcibly cancel outstanding work and leave
>>>    		 * the gpu quiet.
>>>    		 */
>>>    		intel_gt_set_wedged(gt);
>>> -		intel_gt_retire_requests(gt);
>>> -	}
>>>    	intel_gt_pm_wait_for_idle(gt);
>>> +
>>> +	/* Make the GPU available again for swapout */
>>> +	intel_gt_unset_wedged(gt);
>>>    }
>>>    void intel_gt_suspend_prepare(struct intel_gt *gt)
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
index cef3d6f5c34e..74d1dd3793f9 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
@@ -317,19 +317,17 @@  int intel_gt_resume(struct intel_gt *gt)
 
 static void wait_for_suspend(struct intel_gt *gt)
 {
-	if (!intel_gt_pm_is_awake(gt))
-		return;
-
-	if (intel_gt_wait_for_idle(gt, I915_GT_SUSPEND_IDLE_TIMEOUT) == -ETIME) {
+	if (intel_gt_wait_for_idle(gt, I915_GT_SUSPEND_IDLE_TIMEOUT) == -ETIME)
 		/*
 		 * Forcibly cancel outstanding work and leave
 		 * the gpu quiet.
 		 */
 		intel_gt_set_wedged(gt);
-		intel_gt_retire_requests(gt);
-	}
 
 	intel_gt_pm_wait_for_idle(gt);
+
+	/* Make the GPU available again for swapout */
+	intel_gt_unset_wedged(gt);
 }
 
 void intel_gt_suspend_prepare(struct intel_gt *gt)