[v6,3/9] drm/i915/gt: Increase suspend timeout

Message ID	20210922062527.865433-4-thomas.hellstrom@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=at8c=OM=lists.freedesktop.org=intel-gfx-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org EBEC1611C9 From: =?utf-8?q?Thomas_Hellstr=C3=B6m?= <thomas.hellstrom@linux.intel.com> To: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org Cc: maarten.lankhorst@linux.intel.com, matthew.auld@intel.com, =?utf-8?q?Tho?= =?utf-8?q?mas_Hellstr=C3=B6m?= <thomas.hellstrom@linux.intel.com> Date: Wed, 22 Sep 2021 08:25:21 +0200 Message-Id: <20210922062527.865433-4-thomas.hellstrom@linux.intel.com> In-Reply-To: <20210922062527.865433-1-thomas.hellstrom@linux.intel.com> References: <20210922062527.865433-1-thomas.hellstrom@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Subject: [Intel-gfx] [PATCH v6 3/9] drm/i915/gt: Increase suspend timeout Precedence: list Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	drm/i915: Suspend / resume backup- and restore of LMEM. \| expand [v6,0/9] drm/i915: Suspend / resume backup- and restore of LMEM. [v6,1/9] drm/i915/ttm: Implement a function to copy the contents of two TTM-based objects [v6,2/9] drm/i915/gem: Implement a function to process all gem objects of a region [v6,3/9] drm/i915/gt: Increase suspend timeout [v6,4/9] drm/i915 Implement LMEM backup and restore for suspend / resume [v6,5/9] drm/i915/gt: Register the migrate contexts with their engines [v6,6/9] drm/i915: Don't back up pinned LMEM context images and rings during suspend [v6,7/9] drm/i915: Reduce the number of objects subject to memcpy recover [v6,8/9] HAX: component: do not leave master devres group open after bind [v6,9/9] HAX: drm/i915/gem: Fix the __i915_gem_is_lmem() function

Thomas Hellström Sept. 22, 2021, 6:25 a.m. UTC

With GuC submission on DG1, the execution of the requests times out
for the gem_exec_suspend igt test case after executing around 800-900
of 1000 submitted requests.

Given the time we allow elsewhere for fences to signal (in the order of
seconds), increase the timeout before we mark the gt wedged and proceed.

Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
---
 drivers/gpu/drm/i915/gt/intel_gt_pm.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Matthew Auld Sept. 23, 2021, 9:18 a.m. UTC | #1

On 22/09/2021 07:25, Thomas Hellström wrote:
> With GuC submission on DG1, the execution of the requests times out
> for the gem_exec_suspend igt test case after executing around 800-900
> of 1000 submitted requests.
> 
> Given the time we allow elsewhere for fences to signal (in the order of
> seconds), increase the timeout before we mark the gt wedged and proceed.
> 
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Acked-by: Matthew Auld <matthew.auld@intel.com>

> ---
>   drivers/gpu/drm/i915/gt/intel_gt_pm.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> index dea8e2479897..f84f2bfe2de0 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> @@ -19,6 +19,8 @@
>   #include "intel_rps.h"
>   #include "intel_wakeref.h"
>   
> +#define I915_GT_SUSPEND_IDLE_TIMEOUT (HZ / 2)
> +
>   static void user_forcewake(struct intel_gt *gt, bool suspend)
>   {
>   	int count = atomic_read(&gt->user_wakeref);
> @@ -279,7 +281,7 @@ static void wait_for_suspend(struct intel_gt *gt)
>   	if (!intel_gt_pm_is_awake(gt))
>   		return;
>   
> -	if (intel_gt_wait_for_idle(gt, I915_GEM_IDLE_TIMEOUT) == -ETIME) {
> +	if (intel_gt_wait_for_idle(gt, I915_GT_SUSPEND_IDLE_TIMEOUT) == -ETIME) {
>   		/*
>   		 * Forcibly cancel outstanding work and leave
>   		 * the gpu quiet.
>

Tvrtko Ursulin Sept. 23, 2021, 10:13 a.m. UTC | #2

On 22/09/2021 07:25, Thomas Hellström wrote:
> With GuC submission on DG1, the execution of the requests times out
> for the gem_exec_suspend igt test case after executing around 800-900
> of 1000 submitted requests.
> 
> Given the time we allow elsewhere for fences to signal (in the order of
> seconds), increase the timeout before we mark the gt wedged and proceed.

I suspect it is not about requests not retiring in time but about the 
intel_guc_wait_for_idle part of intel_gt_wait_for_idle. Although I don't 
know which G2H message is the code waiting for at suspend time so 
perhaps something to run past the GuC experts.

Anyway, if that turns out to be correct then perhaps it would be better 
to split the two timeouts (like if required GuC timeout is perhaps 
fundamentally independent) so it's clear who needs how much time. Adding 
Matt and John to comment.

To be clear, as timeout is AFAIK an arbitrary value, I don't have 
fundamental objections here. Just think it would be good to have 
accurate story in the commit message.

Regards,

Tvrtko

> 
> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_gt_pm.c | 4 +++-
>   1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> index dea8e2479897..f84f2bfe2de0 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> @@ -19,6 +19,8 @@
>   #include "intel_rps.h"
>   #include "intel_wakeref.h"
>   
> +#define I915_GT_SUSPEND_IDLE_TIMEOUT (HZ / 2)
> +
>   static void user_forcewake(struct intel_gt *gt, bool suspend)
>   {
>   	int count = atomic_read(&gt->user_wakeref);
> @@ -279,7 +281,7 @@ static void wait_for_suspend(struct intel_gt *gt)
>   	if (!intel_gt_pm_is_awake(gt))
>   		return;
>   
> -	if (intel_gt_wait_for_idle(gt, I915_GEM_IDLE_TIMEOUT) == -ETIME) {
> +	if (intel_gt_wait_for_idle(gt, I915_GT_SUSPEND_IDLE_TIMEOUT) == -ETIME) {
>   		/*
>   		 * Forcibly cancel outstanding work and leave
>   		 * the gpu quiet.
>

Thomas Hellström Sept. 23, 2021, 11:47 a.m. UTC | #3

Hi, Tvrtko,

On 9/23/21 12:13 PM, Tvrtko Ursulin wrote:
>
> On 22/09/2021 07:25, Thomas Hellström wrote:
>> With GuC submission on DG1, the execution of the requests times out
>> for the gem_exec_suspend igt test case after executing around 800-900
>> of 1000 submitted requests.
>>
>> Given the time we allow elsewhere for fences to signal (in the order of
>> seconds), increase the timeout before we mark the gt wedged and proceed.
>
> I suspect it is not about requests not retiring in time but about the 
> intel_guc_wait_for_idle part of intel_gt_wait_for_idle. Although I 
> don't know which G2H message is the code waiting for at suspend time 
> so perhaps something to run past the GuC experts.

So what's happening here is that the tests submits 1000 requests, each 
writing a value to an object, and then that object content is checked 
after resume. With GuC it turns out that only 800-900 or so values are 
actually written before we time out, and the test (basic-S3) fails, but 
not on every run.

This is a bit interesting in itself, because I never saw the hang-S3 
test fail, which from what I can tell basically is an identical test but 
with a spinner submitted after the 1000th request. Could be that the 
suspend backup code ends up waiting for something before we end up in 
intel_gt_wait_for_idle, giving more requests time to execute.

>
> Anyway, if that turns out to be correct then perhaps it would be 
> better to split the two timeouts (like if required GuC timeout is 
> perhaps fundamentally independent) so it's clear who needs how much 
> time. Adding Matt and John to comment.

You mean we have separate timeouts depending on whether we're using GuC 
or execlists submission?

>
> To be clear, as timeout is AFAIK an arbitrary value, I don't have 
> fundamental objections here. Just think it would be good to have 
> accurate story in the commit message.

Ok. yes. I wonder whether we actually should increase this timeout even 
more since now the watchdog times out after 10+ seconds? I guess those 
long-running requests could be executing also at suspend time.

/Thomas





>
> Regards,
>
> Tvrtko
>
>>
>> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_gt_pm.c | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c 
>> b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>> index dea8e2479897..f84f2bfe2de0 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>> @@ -19,6 +19,8 @@
>>   #include "intel_rps.h"
>>   #include "intel_wakeref.h"
>>   +#define I915_GT_SUSPEND_IDLE_TIMEOUT (HZ / 2)
>> +
>>   static void user_forcewake(struct intel_gt *gt, bool suspend)
>>   {
>>       int count = atomic_read(&gt->user_wakeref);
>> @@ -279,7 +281,7 @@ static void wait_for_suspend(struct intel_gt *gt)
>>       if (!intel_gt_pm_is_awake(gt))
>>           return;
>>   -    if (intel_gt_wait_for_idle(gt, I915_GEM_IDLE_TIMEOUT) == 
>> -ETIME) {
>> +    if (intel_gt_wait_for_idle(gt, I915_GT_SUSPEND_IDLE_TIMEOUT) == 
>> -ETIME) {
>>           /*
>>            * Forcibly cancel outstanding work and leave
>>            * the gpu quiet.
>>

Tvrtko Ursulin Sept. 23, 2021, 12:59 p.m. UTC | #4

On 23/09/2021 12:47, Thomas Hellström wrote:
> Hi, Tvrtko,
> 
> On 9/23/21 12:13 PM, Tvrtko Ursulin wrote:
>>
>> On 22/09/2021 07:25, Thomas Hellström wrote:
>>> With GuC submission on DG1, the execution of the requests times out
>>> for the gem_exec_suspend igt test case after executing around 800-900
>>> of 1000 submitted requests.
>>>
>>> Given the time we allow elsewhere for fences to signal (in the order of
>>> seconds), increase the timeout before we mark the gt wedged and proceed.
>>
>> I suspect it is not about requests not retiring in time but about the 
>> intel_guc_wait_for_idle part of intel_gt_wait_for_idle. Although I 
>> don't know which G2H message is the code waiting for at suspend time 
>> so perhaps something to run past the GuC experts.
> 
> So what's happening here is that the tests submits 1000 requests, each 
> writing a value to an object, and then that object content is checked 
> after resume. With GuC it turns out that only 800-900 or so values are 
> actually written before we time out, and the test (basic-S3) fails, but 
> not on every run.

Yes and that did not make sense to me. It is a single context even so I 
did not come up with an explanation why would GuC be slower.

Unless it somehow manages to not even update the ring tail in time and 
requests are still only stuck in the software queue? Perhaps you can see 
that from context tail and head when it happens.

> This is a bit interesting in itself, because I never saw the hang-S3 
> test fail, which from what I can tell basically is an identical test but 
> with a spinner submitted after the 1000th request. Could be that the 
> suspend backup code ends up waiting for something before we end up in 
> intel_gt_wait_for_idle, giving more requests time to execute.

No idea, I don't know the suspend paths that well. For instance before 
looking at the code I thought we would preempt what's executing and not 
wait for everything that has been submitted to finish. :)

>> Anyway, if that turns out to be correct then perhaps it would be 
>> better to split the two timeouts (like if required GuC timeout is 
>> perhaps fundamentally independent) so it's clear who needs how much 
>> time. Adding Matt and John to comment.
> 
> You mean we have separate timeouts depending on whether we're using GuC 
> or execlists submission?

No, I don't know yet. First I think we need to figure out what exactly 
is happening.

>> To be clear, as timeout is AFAIK an arbitrary value, I don't have 
>> fundamental objections here. Just think it would be good to have 
>> accurate story in the commit message.
> 
> Ok. yes. I wonder whether we actually should increase this timeout even 
> more since now the watchdog times out after 10+ seconds? I guess those 
> long-running requests could be executing also at suspend time.

We probably should not just increase it hugely. Because watchdog is a 
separate story since it applies to unsubmited and unready requests and 
suspend can happen fine with those around I think.

Regards,

Tvrtko

> /Thomas
> 
> 
> 
> 
> 
>>
>> Regards,
>>
>> Tvrtko
>>
>>>
>>> Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/gt/intel_gt_pm.c | 4 +++-
>>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c 
>>> b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>>> index dea8e2479897..f84f2bfe2de0 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
>>> @@ -19,6 +19,8 @@
>>>   #include "intel_rps.h"
>>>   #include "intel_wakeref.h"
>>>   +#define I915_GT_SUSPEND_IDLE_TIMEOUT (HZ / 2)
>>> +
>>>   static void user_forcewake(struct intel_gt *gt, bool suspend)
>>>   {
>>>       int count = atomic_read(&gt->user_wakeref);
>>> @@ -279,7 +281,7 @@ static void wait_for_suspend(struct intel_gt *gt)
>>>       if (!intel_gt_pm_is_awake(gt))
>>>           return;
>>>   -    if (intel_gt_wait_for_idle(gt, I915_GEM_IDLE_TIMEOUT) == 
>>> -ETIME) {
>>> +    if (intel_gt_wait_for_idle(gt, I915_GT_SUSPEND_IDLE_TIMEOUT) == 
>>> -ETIME) {
>>>           /*
>>>            * Forcibly cancel outstanding work and leave
>>>            * the gpu quiet.
>>>

Thomas Hellström Sept. 23, 2021, 1:19 p.m. UTC | #5

On 9/23/21 2:59 PM, Tvrtko Ursulin wrote:
>
> On 23/09/2021 12:47, Thomas Hellström wrote:
>> Hi, Tvrtko,
>>
>> On 9/23/21 12:13 PM, Tvrtko Ursulin wrote:
>>>
>>> On 22/09/2021 07:25, Thomas Hellström wrote:
>>>> With GuC submission on DG1, the execution of the requests times out
>>>> for the gem_exec_suspend igt test case after executing around 800-900
>>>> of 1000 submitted requests.
>>>>
>>>> Given the time we allow elsewhere for fences to signal (in the 
>>>> order of
>>>> seconds), increase the timeout before we mark the gt wedged and 
>>>> proceed.
>>>
>>> I suspect it is not about requests not retiring in time but about 
>>> the intel_guc_wait_for_idle part of intel_gt_wait_for_idle. Although 
>>> I don't know which G2H message is the code waiting for at suspend 
>>> time so perhaps something to run past the GuC experts.
>>
>> So what's happening here is that the tests submits 1000 requests, 
>> each writing a value to an object, and then that object content is 
>> checked after resume. With GuC it turns out that only 800-900 or so 
>> values are actually written before we time out, and the test 
>> (basic-S3) fails, but not on every run.
>
> Yes and that did not make sense to me. It is a single context even so 
> I did not come up with an explanation why would GuC be slower.
>
> Unless it somehow manages to not even update the ring tail in time and 
> requests are still only stuck in the software queue? Perhaps you can 
> see that from context tail and head when it happens.
>
>> This is a bit interesting in itself, because I never saw the hang-S3 
>> test fail, which from what I can tell basically is an identical test 
>> but with a spinner submitted after the 1000th request. Could be that 
>> the suspend backup code ends up waiting for something before we end 
>> up in intel_gt_wait_for_idle, giving more requests time to execute.
>
> No idea, I don't know the suspend paths that well. For instance before 
> looking at the code I thought we would preempt what's executing and 
> not wait for everything that has been submitted to finish. :)
>
>>> Anyway, if that turns out to be correct then perhaps it would be 
>>> better to split the two timeouts (like if required GuC timeout is 
>>> perhaps fundamentally independent) so it's clear who needs how much 
>>> time. Adding Matt and John to comment.
>>
>> You mean we have separate timeouts depending on whether we're using 
>> GuC or execlists submission?
>
> No, I don't know yet. First I think we need to figure out what exactly 
> is happening.

Well then TBH I will need to file a separate Jira about that. There 
might be various things going on here like swiching between the migrate 
context for eviction of unrelated LMEM buffers and the context used by 
gem_exec_suspend. The gem_exec_suspend failures are blocking DG1 BAT so 
it's pretty urgent to get this series merged. If you insist I can leave 
this patch out for now, but rather I'd commit it as is and File a Jira 
instead.

/Thomas

Tvrtko Ursulin Sept. 23, 2021, 2:33 p.m. UTC | #6

On 23/09/2021 14:19, Thomas Hellström wrote:
> 
> On 9/23/21 2:59 PM, Tvrtko Ursulin wrote:
>>
>> On 23/09/2021 12:47, Thomas Hellström wrote:
>>> Hi, Tvrtko,
>>>
>>> On 9/23/21 12:13 PM, Tvrtko Ursulin wrote:
>>>>
>>>> On 22/09/2021 07:25, Thomas Hellström wrote:
>>>>> With GuC submission on DG1, the execution of the requests times out
>>>>> for the gem_exec_suspend igt test case after executing around 800-900
>>>>> of 1000 submitted requests.
>>>>>
>>>>> Given the time we allow elsewhere for fences to signal (in the 
>>>>> order of
>>>>> seconds), increase the timeout before we mark the gt wedged and 
>>>>> proceed.
>>>>
>>>> I suspect it is not about requests not retiring in time but about 
>>>> the intel_guc_wait_for_idle part of intel_gt_wait_for_idle. Although 
>>>> I don't know which G2H message is the code waiting for at suspend 
>>>> time so perhaps something to run past the GuC experts.
>>>
>>> So what's happening here is that the tests submits 1000 requests, 
>>> each writing a value to an object, and then that object content is 
>>> checked after resume. With GuC it turns out that only 800-900 or so 
>>> values are actually written before we time out, and the test 
>>> (basic-S3) fails, but not on every run.
>>
>> Yes and that did not make sense to me. It is a single context even so 
>> I did not come up with an explanation why would GuC be slower.
>>
>> Unless it somehow manages to not even update the ring tail in time and 
>> requests are still only stuck in the software queue? Perhaps you can 
>> see that from context tail and head when it happens.
>>
>>> This is a bit interesting in itself, because I never saw the hang-S3 
>>> test fail, which from what I can tell basically is an identical test 
>>> but with a spinner submitted after the 1000th request. Could be that 
>>> the suspend backup code ends up waiting for something before we end 
>>> up in intel_gt_wait_for_idle, giving more requests time to execute.
>>
>> No idea, I don't know the suspend paths that well. For instance before 
>> looking at the code I thought we would preempt what's executing and 
>> not wait for everything that has been submitted to finish. :)
>>
>>>> Anyway, if that turns out to be correct then perhaps it would be 
>>>> better to split the two timeouts (like if required GuC timeout is 
>>>> perhaps fundamentally independent) so it's clear who needs how much 
>>>> time. Adding Matt and John to comment.
>>>
>>> You mean we have separate timeouts depending on whether we're using 
>>> GuC or execlists submission?
>>
>> No, I don't know yet. First I think we need to figure out what exactly 
>> is happening.
> 
> Well then TBH I will need to file a separate Jira about that. There 
> might be various things going on here like swiching between the migrate 
> context for eviction of unrelated LMEM buffers and the context used by 
> gem_exec_suspend. The gem_exec_suspend failures are blocking DG1 BAT so 
> it's pretty urgent to get this series merged. If you insist I can leave 
> this patch out for now, but rather I'd commit it as is and File a Jira 
> instead.

I see now how you have i915_gem_suspend() in between two lmem_suspend() 
calls in this series. So first call has the potential of creating a lot 
of requests and that you think interferes? Sounds plausible but implies 
GuC timeslicing is less efficient if I follow?

IMO it is okay to leave for follow up work but strictly speaking, unless 
I am missing something, the approach of bumping the timeout does not 
sound valid if the copying is done async.

Because the timeout is then mandated not only as function of GPU 
activity (lets say user controlled), but also the amount of 
unpinned/idle buffers which happen to be laying around (which is more 
i915 controlled, or mixed at least).

So question is, with enough data to copy, any timeout could be too low 
and then how long do we want to wait before failing suspend? Is this an 
argument to have a separate timeout specifically addressing the suspend 
path or not I am not sure. Perhaps there is no choice and simply wait 
until buffers are swapped out otherwise nothing will work.

Regards,

Tvrtko

Thomas Hellström Sept. 23, 2021, 3:43 p.m. UTC | #7

On 9/23/21 4:33 PM, Tvrtko Ursulin wrote:
>
> On 23/09/2021 14:19, Thomas Hellström wrote:
>>
>> On 9/23/21 2:59 PM, Tvrtko Ursulin wrote:
>>>
>>> On 23/09/2021 12:47, Thomas Hellström wrote:
>>>> Hi, Tvrtko,
>>>>
>>>> On 9/23/21 12:13 PM, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 22/09/2021 07:25, Thomas Hellström wrote:
>>>>>> With GuC submission on DG1, the execution of the requests times out
>>>>>> for the gem_exec_suspend igt test case after executing around 
>>>>>> 800-900
>>>>>> of 1000 submitted requests.
>>>>>>
>>>>>> Given the time we allow elsewhere for fences to signal (in the 
>>>>>> order of
>>>>>> seconds), increase the timeout before we mark the gt wedged and 
>>>>>> proceed.
>>>>>
>>>>> I suspect it is not about requests not retiring in time but about 
>>>>> the intel_guc_wait_for_idle part of intel_gt_wait_for_idle. 
>>>>> Although I don't know which G2H message is the code waiting for at 
>>>>> suspend time so perhaps something to run past the GuC experts.
>>>>
>>>> So what's happening here is that the tests submits 1000 requests, 
>>>> each writing a value to an object, and then that object content is 
>>>> checked after resume. With GuC it turns out that only 800-900 or so 
>>>> values are actually written before we time out, and the test 
>>>> (basic-S3) fails, but not on every run.
>>>
>>> Yes and that did not make sense to me. It is a single context even 
>>> so I did not come up with an explanation why would GuC be slower.
>>>
>>> Unless it somehow manages to not even update the ring tail in time 
>>> and requests are still only stuck in the software queue? Perhaps you 
>>> can see that from context tail and head when it happens.
>>>
>>>> This is a bit interesting in itself, because I never saw the 
>>>> hang-S3 test fail, which from what I can tell basically is an 
>>>> identical test but with a spinner submitted after the 1000th 
>>>> request. Could be that the suspend backup code ends up waiting for 
>>>> something before we end up in intel_gt_wait_for_idle, giving more 
>>>> requests time to execute.
>>>
>>> No idea, I don't know the suspend paths that well. For instance 
>>> before looking at the code I thought we would preempt what's 
>>> executing and not wait for everything that has been submitted to 
>>> finish. :)
>>>
>>>>> Anyway, if that turns out to be correct then perhaps it would be 
>>>>> better to split the two timeouts (like if required GuC timeout is 
>>>>> perhaps fundamentally independent) so it's clear who needs how 
>>>>> much time. Adding Matt and John to comment.
>>>>
>>>> You mean we have separate timeouts depending on whether we're using 
>>>> GuC or execlists submission?
>>>
>>> No, I don't know yet. First I think we need to figure out what 
>>> exactly is happening.
>>
>> Well then TBH I will need to file a separate Jira about that. There 
>> might be various things going on here like swiching between the 
>> migrate context for eviction of unrelated LMEM buffers and the 
>> context used by gem_exec_suspend. The gem_exec_suspend failures are 
>> blocking DG1 BAT so it's pretty urgent to get this series merged. If 
>> you insist I can leave this patch out for now, but rather I'd commit 
>> it as is and File a Jira instead.
>
> I see now how you have i915_gem_suspend() in between two 
> lmem_suspend() calls in this series. So first call has the potential 
> of creating a lot of requests and that you think interferes? Sounds 
> plausible but implies GuC timeslicing is less efficient if I follow?

Yes, I guess so. Not sure exactly what is not performing so well with 
the GuC but some tests really take a big performance hit, like 
gem_lmem_swapping and gem_exec_whisper, but those may trigger entirely 
different situations than what we have here.

>
> IMO it is okay to leave for follow up work but strictly speaking, 
> unless I am missing something, the approach of bumping the timeout 
> does not sound valid if the copying is done async.

Not async ATM. In any case It will probably make sense to sync before we 
start the GT timeout, so that remaining work can be done undisturbed by 
the copying. That way copying will always succeed, but depending on how 
much and what type of work user-space has queued up, it might be terminated.

>
> Because the timeout is then mandated not only as function of GPU 
> activity (lets say user controlled), but also the amount of 
> unpinned/idle buffers which happen to be laying around (which is more 
> i915 controlled, or mixed at least).
>
> So question is, with enough data to copy, any timeout could be too low 
> and then how long do we want to wait before failing suspend? Is this 
> an argument to have a separate timeout specifically addressing the 
> suspend path or not I am not sure. Perhaps there is no choice and 
> simply wait until buffers are swapped out otherwise nothing will work.
>
> Regards,
>
> Tvrtko

Thanks,

Thomas.

[v6,3/9] drm/i915/gt: Increase suspend timeout

Commit Message

Comments

Patch