drm/i915: Clear pending reset requests during suspend

Message ID	1452768585-18661-1-git-send-email-arun.siluvery@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Arun Siluvery <arun.siluvery@linux.intel.com> To: intel-gfx@lists.freedesktop.org Date: Thu, 14 Jan 2016 10:49:45 +0000 Message-Id: <1452768585-18661-1-git-send-email-arun.siluvery@linux.intel.com> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Subject: [Intel-gfx] [PATCH] drm/i915: Clear pending reset requests during suspend Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

arun.siluvery@linux.intel.com Jan. 14, 2016, 10:49 a.m. UTC

Pending reset requests are cleared before suspending, they should be picked up
after resume when new work is submitted.

This is originally added as part of TDR patches for Gen8 from Tomas Elf which
are under review, as suggested by Chris this is extracted as a separate patch
as it can be useful now.

Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c | 7 +++++++
 1 file changed, 7 insertions(+)

kernel test robot Jan. 14, 2016, 11:07 a.m. UTC | #1

Hi Arun,

[auto build test WARNING on drm-intel/for-linux-next]
[also build test WARNING on v4.4 next-20160114]
[if your patch is applied to the wrong git tree, please drop us a note to help improving the system]

url:    https://github.com/0day-ci/linux/commits/Arun-Siluvery/drm-i915-Clear-pending-reset-requests-during-suspend/20160114-185121
base:   git://anongit.freedesktop.org/drm-intel for-linux-next
config: x86_64-randconfig-x010-01140842 (attached as .config)
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   drivers/gpu/drm/i915/i915_drv.c: In function 'i915_drm_suspend':
>> drivers/gpu/drm/i915/i915_drv.c:601:2: warning: 'atomic_clear_mask' is deprecated [-Wdeprecated-declarations]
     atomic_clear_mask(I915_RESET_IN_PROGRESS_FLAG,
     ^
   In file included from include/linux/debug_locks.h:5:0,
                    from include/linux/lockdep.h:23,
                    from include/linux/spinlock_types.h:18,
                    from include/linux/mutex.h:15,
                    from include/linux/kernfs.h:13,
                    from include/linux/sysfs.h:15,
                    from include/linux/kobject.h:21,
                    from include/linux/device.h:17,
                    from drivers/gpu/drm/i915/i915_drv.c:30:
   include/linux/atomic.h:458:33: note: declared here
    static inline __deprecated void atomic_clear_mask(unsigned int mask, atomic_t *v)
                                    ^

vim +/atomic_clear_mask +601 drivers/gpu/drm/i915/i915_drv.c

   585	
   586		drm_kms_helper_poll_disable(dev);
   587	
   588		pci_save_state(dev->pdev);
   589	
   590		error = i915_gem_suspend(dev);
   591		if (error) {
   592			dev_err(&dev->pdev->dev,
   593				"GEM idle failed, resume might fail\n");
   594			goto out;
   595		}
   596	
   597		/*
   598		 * Clear any pending reset requests. They should be picked up
   599		 * after resume when new work is submitted
   600		 */
 > 601		atomic_clear_mask(I915_RESET_IN_PROGRESS_FLAG,
   602				  &dev_priv->gpu_error.reset_counter);
   603	
   604		intel_guc_suspend(dev);
   605	
   606		intel_suspend_gt_powersave(dev);
   607	
   608		/*
   609		 * Disable CRTCs directly since we want to preserve sw state

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

Chris Wilson Jan. 14, 2016, 11:19 a.m. UTC | #2

On Thu, Jan 14, 2016 at 10:49:45AM +0000, Arun Siluvery wrote:
> Pending reset requests are cleared before suspending, they should be picked up
> after resume when new work is submitted.
> 
> This is originally added as part of TDR patches for Gen8 from Tomas Elf which
> are under review, as suggested by Chris this is extracted as a separate patch
> as it can be useful now.
> 
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index f17a2b0..09ed83e 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -594,6 +594,13 @@ static int i915_drm_suspend(struct drm_device *dev)
>  		goto out;
>  	}
>  
> +	/*
> +	 * Clear any pending reset requests. They should be picked up
> +	 * after resume when new work is submitted
> +	 */
> +	atomic_clear_mask(I915_RESET_IN_PROGRESS_FLAG,
> +			  &dev_priv->gpu_error.reset_counter);
> +

The comment is slightly wrong. When the error tasklet in progress sees
that the flag is unset, it return (i.e. doesn't perform the reset).

This is ok, because we are putting the device to PCI_D3, we are powering
it down which should be our ultimate reset. So no need for the reset on
resume. Except.... We do need to clean up the bookkeeping. Hmm. so what
we need to do is actually flush the reset task, and pretend it succeeded.
-Chris

Daniel Vetter Jan. 19, 2016, 12:09 p.m. UTC | #3

On Thu, Jan 14, 2016 at 10:49:45AM +0000, Arun Siluvery wrote:
> Pending reset requests are cleared before suspending, they should be picked up
> after resume when new work is submitted.
> 
> This is originally added as part of TDR patches for Gen8 from Tomas Elf which
> are under review, as suggested by Chris this is extracted as a separate patch
> as it can be useful now.
> 
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>

Pulling in the discussion we had from irc: Imo the right approach is to
simply wait for gpu reset to finish it's job. Since that could in turn
lead to a dead gpu (if we're unlucky and init_hw failed) we'd need to do
that in a loop around gem_idle. And drop dev->struct_mutex in-between.
E.g.

while (busy) {
	mutex_lock();
	gpu_idle();
	mutex_unlock();

	flush_work(reset_work);
}

Cheers, Daniel
> ---
>  drivers/gpu/drm/i915/i915_drv.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index f17a2b0..09ed83e 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -594,6 +594,13 @@ static int i915_drm_suspend(struct drm_device *dev)
>  		goto out;
>  	}
>  
> +	/*
> +	 * Clear any pending reset requests. They should be picked up
> +	 * after resume when new work is submitted
> +	 */
> +	atomic_clear_mask(I915_RESET_IN_PROGRESS_FLAG,
> +			  &dev_priv->gpu_error.reset_counter);
> +
>  	intel_guc_suspend(dev);
>  
>  	intel_suspend_gt_powersave(dev);
> -- 
> 1.9.1
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

Chris Wilson Jan. 19, 2016, 1:48 p.m. UTC | #4

On Tue, Jan 19, 2016 at 01:09:28PM +0100, Daniel Vetter wrote:
> On Thu, Jan 14, 2016 at 10:49:45AM +0000, Arun Siluvery wrote:
> > Pending reset requests are cleared before suspending, they should be picked up
> > after resume when new work is submitted.
> > 
> > This is originally added as part of TDR patches for Gen8 from Tomas Elf which
> > are under review, as suggested by Chris this is extracted as a separate patch
> > as it can be useful now.
> > 
> > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> 
> Pulling in the discussion we had from irc: Imo the right approach is to
> simply wait for gpu reset to finish it's job. Since that could in turn
> lead to a dead gpu (if we're unlucky and init_hw failed) we'd need to do
> that in a loop around gem_idle. And drop dev->struct_mutex in-between.
> E.g.
> 
> while (busy) {
> 	mutex_lock();
> 	gpu_idle();
> 	mutex_unlock();
> 
> 	flush_work(reset_work);
> }

Where does the requirement for gpu_idle come from? If there is a global
reset in progress, it cannot queue a request to flush the work and
waiting on the old results will be skipped. So just wait for the global
reset to complete, i.e. flush_work().
-Chris

Daniel Vetter Jan. 19, 2016, 2:04 p.m. UTC | #5

On Tue, Jan 19, 2016 at 01:48:05PM +0000, Chris Wilson wrote:
> On Tue, Jan 19, 2016 at 01:09:28PM +0100, Daniel Vetter wrote:
> > On Thu, Jan 14, 2016 at 10:49:45AM +0000, Arun Siluvery wrote:
> > > Pending reset requests are cleared before suspending, they should be picked up
> > > after resume when new work is submitted.
> > > 
> > > This is originally added as part of TDR patches for Gen8 from Tomas Elf which
> > > are under review, as suggested by Chris this is extracted as a separate patch
> > > as it can be useful now.
> > > 
> > > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> > 
> > Pulling in the discussion we had from irc: Imo the right approach is to
> > simply wait for gpu reset to finish it's job. Since that could in turn
> > lead to a dead gpu (if we're unlucky and init_hw failed) we'd need to do
> > that in a loop around gem_idle. And drop dev->struct_mutex in-between.
> > E.g.
> > 
> > while (busy) {
> > 	mutex_lock();
> > 	gpu_idle();
> > 	mutex_unlock();
> > 
> > 	flush_work(reset_work);
> > }
> 
> Where does the requirement for gpu_idle come from? If there is a global
> reset in progress, it cannot queue a request to flush the work and
> waiting on the old results will be skipped. So just wait for the global
> reset to complete, i.e. flush_work().

Yes, but the global reset might in turn leave a wrecked gpu behind, or at
least a non-idle one. Hence another gpu_idle on top, to make sure. If we
change init_hw() of engines to be synchronous then we should have at least
a WARN_ON(not_idle_but_i_expected_so()); in there ...
-Daniel

Chris Wilson Jan. 19, 2016, 2:13 p.m. UTC | #6

On Tue, Jan 19, 2016 at 03:04:40PM +0100, Daniel Vetter wrote:
> On Tue, Jan 19, 2016 at 01:48:05PM +0000, Chris Wilson wrote:
> > On Tue, Jan 19, 2016 at 01:09:28PM +0100, Daniel Vetter wrote:
> > > On Thu, Jan 14, 2016 at 10:49:45AM +0000, Arun Siluvery wrote:
> > > > Pending reset requests are cleared before suspending, they should be picked up
> > > > after resume when new work is submitted.
> > > > 
> > > > This is originally added as part of TDR patches for Gen8 from Tomas Elf which
> > > > are under review, as suggested by Chris this is extracted as a separate patch
> > > > as it can be useful now.
> > > > 
> > > > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> > > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > > Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> > > 
> > > Pulling in the discussion we had from irc: Imo the right approach is to
> > > simply wait for gpu reset to finish it's job. Since that could in turn
> > > lead to a dead gpu (if we're unlucky and init_hw failed) we'd need to do
> > > that in a loop around gem_idle. And drop dev->struct_mutex in-between.
> > > E.g.
> > > 
> > > while (busy) {
> > > 	mutex_lock();
> > > 	gpu_idle();
> > > 	mutex_unlock();
> > > 
> > > 	flush_work(reset_work);
> > > }
> > 
> > Where does the requirement for gpu_idle come from? If there is a global
> > reset in progress, it cannot queue a request to flush the work and
> > waiting on the old results will be skipped. So just wait for the global
> > reset to complete, i.e. flush_work().
> 
> Yes, but the global reset might in turn leave a wrecked gpu behind, or at
> least a non-idle one. Hence another gpu_idle on top, to make sure. If we
> change init_hw() of engines to be synchronous then we should have at least
> a WARN_ON(not_idle_but_i_expected_so()); in there ...

Does it matter on suspend? We test on resume if the GPU is usable, but
if we wanted to test on suspend then we should do

flush_work();
if (i915_terminally_wedged())
   /* oh noes */;
-Chris

arun.siluvery@linux.intel.com Jan. 19, 2016, 3:04 p.m. UTC | #7

On 19/01/2016 14:13, Chris Wilson wrote:
> On Tue, Jan 19, 2016 at 03:04:40PM +0100, Daniel Vetter wrote:
>> On Tue, Jan 19, 2016 at 01:48:05PM +0000, Chris Wilson wrote:
>>> On Tue, Jan 19, 2016 at 01:09:28PM +0100, Daniel Vetter wrote:
>>>> On Thu, Jan 14, 2016 at 10:49:45AM +0000, Arun Siluvery wrote:
>>>>> Pending reset requests are cleared before suspending, they should be picked up
>>>>> after resume when new work is submitted.
>>>>>
>>>>> This is originally added as part of TDR patches for Gen8 from Tomas Elf which
>>>>> are under review, as suggested by Chris this is extracted as a separate patch
>>>>> as it can be useful now.
>>>>>
>>>>> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
>>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>>>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>>>>
>>>> Pulling in the discussion we had from irc: Imo the right approach is to
>>>> simply wait for gpu reset to finish it's job. Since that could in turn
>>>> lead to a dead gpu (if we're unlucky and init_hw failed) we'd need to do
>>>> that in a loop around gem_idle. And drop dev->struct_mutex in-between.
>>>> E.g.
>>>>
>>>> while (busy) {
>>>> 	mutex_lock();
>>>> 	gpu_idle();
>>>> 	mutex_unlock();
>>>>
>>>> 	flush_work(reset_work);
>>>> }
>>>
>>> Where does the requirement for gpu_idle come from? If there is a global
>>> reset in progress, it cannot queue a request to flush the work and
>>> waiting on the old results will be skipped. So just wait for the global
>>> reset to complete, i.e. flush_work().
>>
>> Yes, but the global reset might in turn leave a wrecked gpu behind, or at
>> least a non-idle one. Hence another gpu_idle on top, to make sure. If we
>> change init_hw() of engines to be synchronous then we should have at least
>> a WARN_ON(not_idle_but_i_expected_so()); in there ...

gpu_error.work is removed in b8d24a06568368076ebd5a858a011699a97bfa42, 
we are doing reset in hangcheck work itself so I think there is no need 
to flush work.

while (i915_reset_in_progress(gpu_error) &&
        !i915_terminally_wedged(gpu_error)) {
         int ret;

         mutex_lock(&dev->struct_mutex);
         ret = i915_gpu_idle(dev);
         if (ret)
                 DRM_ERROR("GPU is in inconsistent state after reset\n");
         mutex_unlock(&dev->struct_mutex);
}

If the reset is successful we are idle before suspend otherwise in a 
wedged state. is this ok?

regards
Arun

>
> Does it matter on suspend? We test on resume if the GPU is usable, but
> if we wanted to test on suspend then we should do
>
> flush_work();
> if (i915_terminally_wedged())
>     /* oh noes */;
> -Chris
>

Daniel Vetter Jan. 19, 2016, 4:42 p.m. UTC | #8

On Tue, Jan 19, 2016 at 03:04:09PM +0000, Arun Siluvery wrote:
> On 19/01/2016 14:13, Chris Wilson wrote:
> >On Tue, Jan 19, 2016 at 03:04:40PM +0100, Daniel Vetter wrote:
> >>On Tue, Jan 19, 2016 at 01:48:05PM +0000, Chris Wilson wrote:
> >>>On Tue, Jan 19, 2016 at 01:09:28PM +0100, Daniel Vetter wrote:
> >>>>On Thu, Jan 14, 2016 at 10:49:45AM +0000, Arun Siluvery wrote:
> >>>>>Pending reset requests are cleared before suspending, they should be picked up
> >>>>>after resume when new work is submitted.
> >>>>>
> >>>>>This is originally added as part of TDR patches for Gen8 from Tomas Elf which
> >>>>>are under review, as suggested by Chris this is extracted as a separate patch
> >>>>>as it can be useful now.
> >>>>>
> >>>>>Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> >>>>>Cc: Chris Wilson <chris@chris-wilson.co.uk>
> >>>>>Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> >>>>
> >>>>Pulling in the discussion we had from irc: Imo the right approach is to
> >>>>simply wait for gpu reset to finish it's job. Since that could in turn
> >>>>lead to a dead gpu (if we're unlucky and init_hw failed) we'd need to do
> >>>>that in a loop around gem_idle. And drop dev->struct_mutex in-between.
> >>>>E.g.
> >>>>
> >>>>while (busy) {
> >>>>	mutex_lock();
> >>>>	gpu_idle();
> >>>>	mutex_unlock();
> >>>>
> >>>>	flush_work(reset_work);
> >>>>}
> >>>
> >>>Where does the requirement for gpu_idle come from? If there is a global
> >>>reset in progress, it cannot queue a request to flush the work and
> >>>waiting on the old results will be skipped. So just wait for the global
> >>>reset to complete, i.e. flush_work().
> >>
> >>Yes, but the global reset might in turn leave a wrecked gpu behind, or at
> >>least a non-idle one. Hence another gpu_idle on top, to make sure. If we
> >>change init_hw() of engines to be synchronous then we should have at least
> >>a WARN_ON(not_idle_but_i_expected_so()); in there ...
> 
> gpu_error.work is removed in b8d24a06568368076ebd5a858a011699a97bfa42, we

git sha1 from your private tree are meaningless in the public. Either link
to some git weburl or mailing lists archive link.

Thanks, Daniel

> are doing reset in hangcheck work itself so I think there is no need to
> flush work.
> 
> while (i915_reset_in_progress(gpu_error) &&
>        !i915_terminally_wedged(gpu_error)) {
>         int ret;
> 
>         mutex_lock(&dev->struct_mutex);
>         ret = i915_gpu_idle(dev);
>         if (ret)
>                 DRM_ERROR("GPU is in inconsistent state after reset\n");
>         mutex_unlock(&dev->struct_mutex);
> }
> 
> If the reset is successful we are idle before suspend otherwise in a wedged
> state. is this ok?
> 
> regards
> Arun
> 
> >
> >Does it matter on suspend? We test on resume if the GPU is usable, but
> >if we wanted to test on suspend then we should do
> >
> >flush_work();
> >if (i915_terminally_wedged())
> >    /* oh noes */;
> >-Chris
> >
>

arun.siluvery@linux.intel.com Jan. 19, 2016, 5:01 p.m. UTC | #9

On 19/01/2016 16:42, Daniel Vetter wrote:
> On Tue, Jan 19, 2016 at 03:04:09PM +0000, Arun Siluvery wrote:
>> On 19/01/2016 14:13, Chris Wilson wrote:
>>> On Tue, Jan 19, 2016 at 03:04:40PM +0100, Daniel Vetter wrote:
>>>> On Tue, Jan 19, 2016 at 01:48:05PM +0000, Chris Wilson wrote:
>>>>> On Tue, Jan 19, 2016 at 01:09:28PM +0100, Daniel Vetter wrote:
>>>>>> On Thu, Jan 14, 2016 at 10:49:45AM +0000, Arun Siluvery wrote:
>>>>>>> Pending reset requests are cleared before suspending, they should be picked up
>>>>>>> after resume when new work is submitted.
>>>>>>>
>>>>>>> This is originally added as part of TDR patches for Gen8 from Tomas Elf which
>>>>>>> are under review, as suggested by Chris this is extracted as a separate patch
>>>>>>> as it can be useful now.
>>>>>>>
>>>>>>> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
>>>>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>>>>>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>>>>>>
>>>>>> Pulling in the discussion we had from irc: Imo the right approach is to
>>>>>> simply wait for gpu reset to finish it's job. Since that could in turn
>>>>>> lead to a dead gpu (if we're unlucky and init_hw failed) we'd need to do
>>>>>> that in a loop around gem_idle. And drop dev->struct_mutex in-between.
>>>>>> E.g.
>>>>>>
>>>>>> while (busy) {
>>>>>> 	mutex_lock();
>>>>>> 	gpu_idle();
>>>>>> 	mutex_unlock();
>>>>>>
>>>>>> 	flush_work(reset_work);
>>>>>> }
>>>>>
>>>>> Where does the requirement for gpu_idle come from? If there is a global
>>>>> reset in progress, it cannot queue a request to flush the work and
>>>>> waiting on the old results will be skipped. So just wait for the global
>>>>> reset to complete, i.e. flush_work().
>>>>
>>>> Yes, but the global reset might in turn leave a wrecked gpu behind, or at
>>>> least a non-idle one. Hence another gpu_idle on top, to make sure. If we
>>>> change init_hw() of engines to be synchronous then we should have at least
>>>> a WARN_ON(not_idle_but_i_expected_so()); in there ...
>>
>> gpu_error.work is removed in b8d24a06568368076ebd5a858a011699a97bfa42, we
>
> git sha1 from your private tree are meaningless in the public. Either link
> to some git weburl or mailing lists archive link.

It is from drm-intel repo,
http://cgit.freedesktop.org/drm-intel/commit/?id=b8d24a06568368076ebd5a858a011699a97bfa42

http://lists.freedesktop.org/archives/intel-gfx/2015-January/059154.html

regards
Arun

>
> Thanks, Daniel
>
>> are doing reset in hangcheck work itself so I think there is no need to
>> flush work.
>>
>> while (i915_reset_in_progress(gpu_error) &&
>>         !i915_terminally_wedged(gpu_error)) {
>>          int ret;
>>
>>          mutex_lock(&dev->struct_mutex);
>>          ret = i915_gpu_idle(dev);
>>          if (ret)
>>                  DRM_ERROR("GPU is in inconsistent state after reset\n");
>>          mutex_unlock(&dev->struct_mutex);
>> }
>>
>> If the reset is successful we are idle before suspend otherwise in a wedged
>> state. is this ok?
>>
>> regards
>> Arun
>>
>>>
>>> Does it matter on suspend? We test on resume if the GPU is usable, but
>>> if we wanted to test on suspend then we should do
>>>
>>> flush_work();
>>> if (i915_terminally_wedged())
>>>     /* oh noes */;
>>> -Chris
>>>
>>
>

Daniel Vetter Jan. 19, 2016, 5:18 p.m. UTC | #10

On Tue, Jan 19, 2016 at 05:01:00PM +0000, Arun Siluvery wrote:
> On 19/01/2016 16:42, Daniel Vetter wrote:
> >On Tue, Jan 19, 2016 at 03:04:09PM +0000, Arun Siluvery wrote:
> >>On 19/01/2016 14:13, Chris Wilson wrote:
> >>>On Tue, Jan 19, 2016 at 03:04:40PM +0100, Daniel Vetter wrote:
> >>>>On Tue, Jan 19, 2016 at 01:48:05PM +0000, Chris Wilson wrote:
> >>>>>On Tue, Jan 19, 2016 at 01:09:28PM +0100, Daniel Vetter wrote:
> >>>>>>On Thu, Jan 14, 2016 at 10:49:45AM +0000, Arun Siluvery wrote:
> >>>>>>>Pending reset requests are cleared before suspending, they should be picked up
> >>>>>>>after resume when new work is submitted.
> >>>>>>>
> >>>>>>>This is originally added as part of TDR patches for Gen8 from Tomas Elf which
> >>>>>>>are under review, as suggested by Chris this is extracted as a separate patch
> >>>>>>>as it can be useful now.
> >>>>>>>
> >>>>>>>Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> >>>>>>>Cc: Chris Wilson <chris@chris-wilson.co.uk>
> >>>>>>>Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> >>>>>>
> >>>>>>Pulling in the discussion we had from irc: Imo the right approach is to
> >>>>>>simply wait for gpu reset to finish it's job. Since that could in turn
> >>>>>>lead to a dead gpu (if we're unlucky and init_hw failed) we'd need to do
> >>>>>>that in a loop around gem_idle. And drop dev->struct_mutex in-between.
> >>>>>>E.g.
> >>>>>>
> >>>>>>while (busy) {
> >>>>>>	mutex_lock();
> >>>>>>	gpu_idle();
> >>>>>>	mutex_unlock();
> >>>>>>
> >>>>>>	flush_work(reset_work);
> >>>>>>}
> >>>>>
> >>>>>Where does the requirement for gpu_idle come from? If there is a global
> >>>>>reset in progress, it cannot queue a request to flush the work and
> >>>>>waiting on the old results will be skipped. So just wait for the global
> >>>>>reset to complete, i.e. flush_work().
> >>>>
> >>>>Yes, but the global reset might in turn leave a wrecked gpu behind, or at
> >>>>least a non-idle one. Hence another gpu_idle on top, to make sure. If we
> >>>>change init_hw() of engines to be synchronous then we should have at least
> >>>>a WARN_ON(not_idle_but_i_expected_so()); in there ...
> >>
> >>gpu_error.work is removed in b8d24a06568368076ebd5a858a011699a97bfa42, we
> >
> >git sha1 from your private tree are meaningless in the public. Either link
> >to some git weburl or mailing lists archive link.
> 
> It is from drm-intel repo,
> http://cgit.freedesktop.org/drm-intel/commit/?id=b8d24a06568368076ebd5a858a011699a97bfa42
> 
> http://lists.freedesktop.org/archives/intel-gfx/2015-January/059154.html

Oh right, forgot that this landed, sorry for the confusion.

Summary of our irc discussion: We idle the gpu and flush the hangcheck
(which should flush the reset work) so at least with current upstream
there shouldn't be a bug. If there is a bug we need to understand it, we
can't just add code without clear explanation and reasons: At best that
confuses, at worst it hides some real bugs.
-Daniel

> 
> regards
> Arun
> 
> >
> >Thanks, Daniel
> >
> >>are doing reset in hangcheck work itself so I think there is no need to
> >>flush work.
> >>
> >>while (i915_reset_in_progress(gpu_error) &&
> >>        !i915_terminally_wedged(gpu_error)) {
> >>         int ret;
> >>
> >>         mutex_lock(&dev->struct_mutex);
> >>         ret = i915_gpu_idle(dev);
> >>         if (ret)
> >>                 DRM_ERROR("GPU is in inconsistent state after reset\n");
> >>         mutex_unlock(&dev->struct_mutex);
> >>}
> >>
> >>If the reset is successful we are idle before suspend otherwise in a wedged
> >>state. is this ok?
> >>
> >>regards
> >>Arun
> >>
> >>>
> >>>Does it matter on suspend? We test on resume if the GPU is usable, but
> >>>if we wanted to test on suspend then we should do
> >>>
> >>>flush_work();
> >>>if (i915_terminally_wedged())
> >>>    /* oh noes */;
> >>>-Chris
> >>>
> >>
> >
>

drm/i915: Clear pending reset requests during suspend

Commit Message

Comments

Patch