diff mbox series

drm/i915: Don't wait forever in drop_caches

Message ID 20221101235053.1650364-1-John.C.Harrison@Intel.com (mailing list archive)
State New, archived
Headers show
Series drm/i915: Don't wait forever in drop_caches | expand

Commit Message

John Harrison Nov. 1, 2022, 11:50 p.m. UTC
From: John Harrison <John.C.Harrison@Intel.com>

At the end of each test, IGT does a drop caches call via sysfs with
special flags set. One of the possible paths waits for idle with an
infinite timeout. That causes problems for debugging issues when CI
catches a "can't go idle" test failure. Best case, the CI system times
out (after 90s), attempts a bunch of state dump actions and then
reboots the system to recover it. Worst case, the CI system can't do
anything at all and then times out (after 1000s) and simply reboots.
Sometimes a serial port log of dmesg might be available, sometimes not.

So rather than making life hard for ourselves, change the timeout to
be 10s rather than infinite. Also, trigger the standard
wedge/reset/recover sequence so that testing can continue with a
working system (if possible).

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Jani Nikula Nov. 2, 2022, 12:12 p.m. UTC | #1
On Tue, 01 Nov 2022, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
>
> At the end of each test, IGT does a drop caches call via sysfs with

sysfs?

> special flags set. One of the possible paths waits for idle with an
> infinite timeout. That causes problems for debugging issues when CI
> catches a "can't go idle" test failure. Best case, the CI system times
> out (after 90s), attempts a bunch of state dump actions and then
> reboots the system to recover it. Worst case, the CI system can't do
> anything at all and then times out (after 1000s) and simply reboots.
> Sometimes a serial port log of dmesg might be available, sometimes not.
>
> So rather than making life hard for ourselves, change the timeout to
> be 10s rather than infinite. Also, trigger the standard
> wedge/reset/recover sequence so that testing can continue with a
> working system (if possible).
>
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>  drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
> index ae987e92251dd..9d916fbbfc27c 100644
> --- a/drivers/gpu/drm/i915/i915_debugfs.c
> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> @@ -641,6 +641,9 @@ DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
>  		  DROP_RESET_ACTIVE | \
>  		  DROP_RESET_SEQNO | \
>  		  DROP_RCU)
> +
> +#define DROP_IDLE_TIMEOUT	(HZ * 10)

I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's also only used
here.

I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only used in
gt/intel_gt.c.

I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only in intel_gt_pm.c.

I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies.

My head spins.


BR,
Jani.


> +
>  static int
>  i915_drop_caches_get(void *data, u64 *val)
>  {
> @@ -661,7 +664,9 @@ gt_drop_caches(struct intel_gt *gt, u64 val)
>  		intel_gt_retire_requests(gt);
>  
>  	if (val & (DROP_IDLE | DROP_ACTIVE)) {
> -		ret = intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
> +		ret = intel_gt_wait_for_idle(gt, DROP_IDLE_TIMEOUT);
> +		if (ret == -ETIME)
> +			intel_gt_set_wedged(gt);
>  		if (ret)
>  			return ret;
>  	}
Tvrtko Ursulin Nov. 2, 2022, 2:20 p.m. UTC | #2
On 02/11/2022 12:12, Jani Nikula wrote:
> On Tue, 01 Nov 2022, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> At the end of each test, IGT does a drop caches call via sysfs with
> 
> sysfs?
> 
>> special flags set. One of the possible paths waits for idle with an
>> infinite timeout. That causes problems for debugging issues when CI
>> catches a "can't go idle" test failure. Best case, the CI system times
>> out (after 90s), attempts a bunch of state dump actions and then
>> reboots the system to recover it. Worst case, the CI system can't do
>> anything at all and then times out (after 1000s) and simply reboots.
>> Sometimes a serial port log of dmesg might be available, sometimes not.
>>
>> So rather than making life hard for ourselves, change the timeout to
>> be 10s rather than infinite. Also, trigger the standard
>> wedge/reset/recover sequence so that testing can continue with a
>> working system (if possible).
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
>> index ae987e92251dd..9d916fbbfc27c 100644
>> --- a/drivers/gpu/drm/i915/i915_debugfs.c
>> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
>> @@ -641,6 +641,9 @@ DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
>>   		  DROP_RESET_ACTIVE | \
>>   		  DROP_RESET_SEQNO | \
>>   		  DROP_RCU)
>> +
>> +#define DROP_IDLE_TIMEOUT	(HZ * 10)
> 
> I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's also only used
> here.

So move here, dropping i915 prefix, next to the newly proposed one?

> I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only used in
> gt/intel_gt.c.

Move there and rename to GT_IDLE_TIMEOUT?

> I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only in intel_gt_pm.c.

No action needed, maybe drop i915 prefix if wanted.

> I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies.

Add _MS suffix if wanted.

> My head spins.

I follow and raise that the newly proposed DROP_IDLE_TIMEOUT applies to 
DROP_ACTIVE and not only DROP_IDLE.

Things get refactored, code moves around, bits get left behind, who 
knows. No reason to get too worked up. :) As long as people are taking a 
wider view when touching the code base, and are not afraid to send 
cleanups, things should be good.

For the actual functional change at hand - it would be nice if code 
paths in question could handle SIGINT and then we could punt the 
decision on how long someone wants to wait purely to userspace. But it's 
probably hard and it's only debugfs so whatever.

Whether or not 10s is enough CI will hopefully tell us. I'd probably err 
on the side of safety and make it longer, but at most half from the test 
runner timeout.

I am not convinced that wedging is correct though. Conceptually could be 
just that the timeout is too short. What does wedging really give us, on 
top of limiting the wait, when latter AFAIU is the key factor which 
would prevent the need to reboot the machine?

Regards,

Tvrtko
John Harrison Nov. 3, 2022, 1:33 a.m. UTC | #3
On 11/2/2022 07:20, Tvrtko Ursulin wrote:
> On 02/11/2022 12:12, Jani Nikula wrote:
>> On Tue, 01 Nov 2022, John.C.Harrison@Intel.com wrote:
>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>
>>> At the end of each test, IGT does a drop caches call via sysfs with
>>
>> sysfs?
Sorry, that was meant to say debugfs. I've also been working on some 
sysfs IGT issues and evidently got my wires crossed!

>>
>>> special flags set. One of the possible paths waits for idle with an
>>> infinite timeout. That causes problems for debugging issues when CI
>>> catches a "can't go idle" test failure. Best case, the CI system times
>>> out (after 90s), attempts a bunch of state dump actions and then
>>> reboots the system to recover it. Worst case, the CI system can't do
>>> anything at all and then times out (after 1000s) and simply reboots.
>>> Sometimes a serial port log of dmesg might be available, sometimes not.
>>>
>>> So rather than making life hard for ourselves, change the timeout to
>>> be 10s rather than infinite. Also, trigger the standard
>>> wedge/reset/recover sequence so that testing can continue with a
>>> working system (if possible).
>>>
>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
>>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c 
>>> b/drivers/gpu/drm/i915/i915_debugfs.c
>>> index ae987e92251dd..9d916fbbfc27c 100644
>>> --- a/drivers/gpu/drm/i915/i915_debugfs.c
>>> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
>>> @@ -641,6 +641,9 @@ DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
>>>             DROP_RESET_ACTIVE | \
>>>             DROP_RESET_SEQNO | \
>>>             DROP_RCU)
>>> +
>>> +#define DROP_IDLE_TIMEOUT    (HZ * 10)
>>
>> I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's also only used
>> here.
>
> So move here, dropping i915 prefix, next to the newly proposed one?
Sure, can do that.

>
>> I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only used in
>> gt/intel_gt.c.
>
> Move there and rename to GT_IDLE_TIMEOUT?
>
>> I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only in intel_gt_pm.c.
>
> No action needed, maybe drop i915 prefix if wanted.
>
These two are totally unrelated and in code not being touched by this 
change. I would rather not conflate changing random other things with 
fixing this specific issue.

>> I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies.
>
> Add _MS suffix if wanted.
>
>> My head spins.
>
> I follow and raise that the newly proposed DROP_IDLE_TIMEOUT applies 
> to DROP_ACTIVE and not only DROP_IDLE.
My original intention for the name was that is the 'drop caches timeout 
for intel_gt_wait_for_idle'. Which is quite the mouthful and hence 
abbreviated to DROP_IDLE_TIMEOUT. But yes, I realised later that name 
can be conflated with the DROP_IDLE flag. Will rename.


>
> Things get refactored, code moves around, bits get left behind, who 
> knows. No reason to get too worked up. :) As long as people are taking 
> a wider view when touching the code base, and are not afraid to send 
> cleanups, things should be good.
On the other hand, if every patch gets blocked in code review because 
someone points out some completely unrelated piece of code could be a 
bit better then nothing ever gets fixed. If you spot something that you 
think should be improved, isn't the general idea that you should post a 
patch yourself to improve it?


>
> For the actual functional change at hand - it would be nice if code 
> paths in question could handle SIGINT and then we could punt the 
> decision on how long someone wants to wait purely to userspace. But 
> it's probably hard and it's only debugfs so whatever.
>
The code paths in question will already abort on a signal won't they? 
Both intel_gt_wait_for_idle() and intel_guc_wait_for_pending_msg(), 
which is where the uc_wait_for_idle eventually ends up, have an 
'if(signal_pending) return -EINTR;' check. Beyond that, it sounds like 
what you are asking for is a change in the IGT libraries and/or CI 
framework to start sending signals after some specific timeout. That 
seems like a significantly more complex change (in terms of the number 
of entities affected and number of groups involved) and unnecessary.

> Whether or not 10s is enough CI will hopefully tell us. I'd probably 
> err on the side of safety and make it longer, but at most half from 
> the test runner timeout.
This is supposed to be test clean up. This is not about how long a 
particular test takes to complete but about how long it takes to declare 
the system broken after the test has already finished. I would argue 
that even 10s is massively longer than required.

>
> I am not convinced that wedging is correct though. Conceptually could 
> be just that the timeout is too short. What does wedging really give 
> us, on top of limiting the wait, when latter AFAIU is the key factor 
> which would prevent the need to reboot the machine?
>
It gives us a system that knows what state it is in. If we can't idle 
the GT then something is very badly wrong. Wedging indicates that. It 
also ensure that a full GT reset will be attempted before the next test 
is run. Helping to prevent a failure on test X from propagating into 
failures of unrelated tests X+1, X+2, ... And if the GT reset does not 
work because the system is really that badly broken then future tests 
will not run rather than report erroneous failures.

This is not about getting a more stable system for end users by sweeping 
issues under the carpet and pretending all is well. End users don't run 
IGTs or explicitly call dodgy debugfs entry points. The sole motivation 
here is to get more accurate results from CI. That is, correctly 
identifying which test has hit a problem, getting valid debug analysis 
for that test (logs and such) and allowing further testing to complete 
correctly in the case where the system can be recovered.

John.

> Regards,
>
> Tvrtko
Tvrtko Ursulin Nov. 3, 2022, 9:18 a.m. UTC | #4
On 03/11/2022 01:33, John Harrison wrote:
> On 11/2/2022 07:20, Tvrtko Ursulin wrote:
>> On 02/11/2022 12:12, Jani Nikula wrote:
>>> On Tue, 01 Nov 2022, John.C.Harrison@Intel.com wrote:
>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>
>>>> At the end of each test, IGT does a drop caches call via sysfs with
>>>
>>> sysfs?
> Sorry, that was meant to say debugfs. I've also been working on some 
> sysfs IGT issues and evidently got my wires crossed!
> 
>>>
>>>> special flags set. One of the possible paths waits for idle with an
>>>> infinite timeout. That causes problems for debugging issues when CI
>>>> catches a "can't go idle" test failure. Best case, the CI system times
>>>> out (after 90s), attempts a bunch of state dump actions and then
>>>> reboots the system to recover it. Worst case, the CI system can't do
>>>> anything at all and then times out (after 1000s) and simply reboots.
>>>> Sometimes a serial port log of dmesg might be available, sometimes not.
>>>>
>>>> So rather than making life hard for ourselves, change the timeout to
>>>> be 10s rather than infinite. Also, trigger the standard
>>>> wedge/reset/recover sequence so that testing can continue with a
>>>> working system (if possible).
>>>>
>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>> ---
>>>>   drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
>>>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c 
>>>> b/drivers/gpu/drm/i915/i915_debugfs.c
>>>> index ae987e92251dd..9d916fbbfc27c 100644
>>>> --- a/drivers/gpu/drm/i915/i915_debugfs.c
>>>> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
>>>> @@ -641,6 +641,9 @@ DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
>>>>             DROP_RESET_ACTIVE | \
>>>>             DROP_RESET_SEQNO | \
>>>>             DROP_RCU)
>>>> +
>>>> +#define DROP_IDLE_TIMEOUT    (HZ * 10)
>>>
>>> I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's also only used
>>> here.
>>
>> So move here, dropping i915 prefix, next to the newly proposed one?
> Sure, can do that.
> 
>>
>>> I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only used in
>>> gt/intel_gt.c.
>>
>> Move there and rename to GT_IDLE_TIMEOUT?
>>
>>> I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only in intel_gt_pm.c.
>>
>> No action needed, maybe drop i915 prefix if wanted.
>>
> These two are totally unrelated and in code not being touched by this 
> change. I would rather not conflate changing random other things with 
> fixing this specific issue.
> 
>>> I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies.
>>
>> Add _MS suffix if wanted.
>>
>>> My head spins.
>>
>> I follow and raise that the newly proposed DROP_IDLE_TIMEOUT applies 
>> to DROP_ACTIVE and not only DROP_IDLE.
> My original intention for the name was that is the 'drop caches timeout 
> for intel_gt_wait_for_idle'. Which is quite the mouthful and hence 
> abbreviated to DROP_IDLE_TIMEOUT. But yes, I realised later that name 
> can be conflated with the DROP_IDLE flag. Will rename.
> 
> 
>>
>> Things get refactored, code moves around, bits get left behind, who 
>> knows. No reason to get too worked up. :) As long as people are taking 
>> a wider view when touching the code base, and are not afraid to send 
>> cleanups, things should be good.
> On the other hand, if every patch gets blocked in code review because 
> someone points out some completely unrelated piece of code could be a 
> bit better then nothing ever gets fixed. If you spot something that you 
> think should be improved, isn't the general idea that you should post a 
> patch yourself to improve it?

There's two maintainers per branch and an order of magnitude or two more 
developers so it'd be nice if cleanups would just be incoming on 
self-initiative basis. ;)

>> For the actual functional change at hand - it would be nice if code 
>> paths in question could handle SIGINT and then we could punt the 
>> decision on how long someone wants to wait purely to userspace. But 
>> it's probably hard and it's only debugfs so whatever.
>>
> The code paths in question will already abort on a signal won't they? 
> Both intel_gt_wait_for_idle() and intel_guc_wait_for_pending_msg(), 
> which is where the uc_wait_for_idle eventually ends up, have an 
> 'if(signal_pending) return -EINTR;' check. Beyond that, it sounds like 
> what you are asking for is a change in the IGT libraries and/or CI 
> framework to start sending signals after some specific timeout. That 
> seems like a significantly more complex change (in terms of the number 
> of entities affected and number of groups involved) and unnecessary.

If you say so, I haven't looked at them all. But if the code path in 
question already aborts on signals then I am not sure what is the patch 
fixing? I assumed you are trying to avoid the write stuck in D forever, 
which then prevents driver unload and everything, requiring the test 
runner to eventually reboot. If you say SIGINT works then you can 
already recover from userspace, no?

>> Whether or not 10s is enough CI will hopefully tell us. I'd probably 
>> err on the side of safety and make it longer, but at most half from 
>> the test runner timeout.
> This is supposed to be test clean up. This is not about how long a 
> particular test takes to complete but about how long it takes to declare 
> the system broken after the test has already finished. I would argue 
> that even 10s is massively longer than required.
> 
>>
>> I am not convinced that wedging is correct though. Conceptually could 
>> be just that the timeout is too short. What does wedging really give 
>> us, on top of limiting the wait, when latter AFAIU is the key factor 
>> which would prevent the need to reboot the machine?
>>
> It gives us a system that knows what state it is in. If we can't idle 
> the GT then something is very badly wrong. Wedging indicates that. It 
> also ensure that a full GT reset will be attempted before the next test 
> is run. Helping to prevent a failure on test X from propagating into 
> failures of unrelated tests X+1, X+2, ... And if the GT reset does not 
> work because the system is really that badly broken then future tests 
> will not run rather than report erroneous failures.
> 
> This is not about getting a more stable system for end users by sweeping 
> issues under the carpet and pretending all is well. End users don't run 
> IGTs or explicitly call dodgy debugfs entry points. The sole motivation 
> here is to get more accurate results from CI. That is, correctly 
> identifying which test has hit a problem, getting valid debug analysis 
> for that test (logs and such) and allowing further testing to complete 
> correctly in the case where the system can be recovered.

I don't really oppose shortening of the timeout in principle, just want 
a clear statement if this is something IGT / test runner could already 
do or not. It can apply a timeout, it can also send SIGINT, and it could 
even trigger a reset from outside. Sure it is debugfs hacks so general 
"kernel should not implement policy" need not be strictly followed, but 
lets have it clear what are the options.

Regards,

Tvrtko
Tvrtko Ursulin Nov. 3, 2022, 9:38 a.m. UTC | #5
On 03/11/2022 09:18, Tvrtko Ursulin wrote:
> 
> On 03/11/2022 01:33, John Harrison wrote:
>> On 11/2/2022 07:20, Tvrtko Ursulin wrote:
>>> On 02/11/2022 12:12, Jani Nikula wrote:
>>>> On Tue, 01 Nov 2022, John.C.Harrison@Intel.com wrote:
>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>
>>>>> At the end of each test, IGT does a drop caches call via sysfs with
>>>>
>>>> sysfs?
>> Sorry, that was meant to say debugfs. I've also been working on some 
>> sysfs IGT issues and evidently got my wires crossed!
>>
>>>>
>>>>> special flags set. One of the possible paths waits for idle with an
>>>>> infinite timeout. That causes problems for debugging issues when CI
>>>>> catches a "can't go idle" test failure. Best case, the CI system times
>>>>> out (after 90s), attempts a bunch of state dump actions and then
>>>>> reboots the system to recover it. Worst case, the CI system can't do
>>>>> anything at all and then times out (after 1000s) and simply reboots.
>>>>> Sometimes a serial port log of dmesg might be available, sometimes 
>>>>> not.
>>>>>
>>>>> So rather than making life hard for ourselves, change the timeout to
>>>>> be 10s rather than infinite. Also, trigger the standard
>>>>> wedge/reset/recover sequence so that testing can continue with a
>>>>> working system (if possible).
>>>>>
>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>>> ---
>>>>>   drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
>>>>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c 
>>>>> b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>> index ae987e92251dd..9d916fbbfc27c 100644
>>>>> --- a/drivers/gpu/drm/i915/i915_debugfs.c
>>>>> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>> @@ -641,6 +641,9 @@ DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
>>>>>             DROP_RESET_ACTIVE | \
>>>>>             DROP_RESET_SEQNO | \
>>>>>             DROP_RCU)
>>>>> +
>>>>> +#define DROP_IDLE_TIMEOUT    (HZ * 10)
>>>>
>>>> I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's also only used
>>>> here.
>>>
>>> So move here, dropping i915 prefix, next to the newly proposed one?
>> Sure, can do that.
>>
>>>
>>>> I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only used in
>>>> gt/intel_gt.c.
>>>
>>> Move there and rename to GT_IDLE_TIMEOUT?
>>>
>>>> I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only in intel_gt_pm.c.
>>>
>>> No action needed, maybe drop i915 prefix if wanted.
>>>
>> These two are totally unrelated and in code not being touched by this 
>> change. I would rather not conflate changing random other things with 
>> fixing this specific issue.
>>
>>>> I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies.
>>>
>>> Add _MS suffix if wanted.
>>>
>>>> My head spins.
>>>
>>> I follow and raise that the newly proposed DROP_IDLE_TIMEOUT applies 
>>> to DROP_ACTIVE and not only DROP_IDLE.
>> My original intention for the name was that is the 'drop caches 
>> timeout for intel_gt_wait_for_idle'. Which is quite the mouthful and 
>> hence abbreviated to DROP_IDLE_TIMEOUT. But yes, I realised later that 
>> name can be conflated with the DROP_IDLE flag. Will rename.
>>
>>
>>>
>>> Things get refactored, code moves around, bits get left behind, who 
>>> knows. No reason to get too worked up. :) As long as people are 
>>> taking a wider view when touching the code base, and are not afraid 
>>> to send cleanups, things should be good.
>> On the other hand, if every patch gets blocked in code review because 
>> someone points out some completely unrelated piece of code could be a 
>> bit better then nothing ever gets fixed. If you spot something that 
>> you think should be improved, isn't the general idea that you should 
>> post a patch yourself to improve it?
> 
> There's two maintainers per branch and an order of magnitude or two more 
> developers so it'd be nice if cleanups would just be incoming on 
> self-initiative basis. ;)
> 
>>> For the actual functional change at hand - it would be nice if code 
>>> paths in question could handle SIGINT and then we could punt the 
>>> decision on how long someone wants to wait purely to userspace. But 
>>> it's probably hard and it's only debugfs so whatever.
>>>
>> The code paths in question will already abort on a signal won't they? 
>> Both intel_gt_wait_for_idle() and intel_guc_wait_for_pending_msg(), 
>> which is where the uc_wait_for_idle eventually ends up, have an 
>> 'if(signal_pending) return -EINTR;' check. Beyond that, it sounds like 
>> what you are asking for is a change in the IGT libraries and/or CI 
>> framework to start sending signals after some specific timeout. That 
>> seems like a significantly more complex change (in terms of the number 
>> of entities affected and number of groups involved) and unnecessary.
> 
> If you say so, I haven't looked at them all. But if the code path in 
> question already aborts on signals then I am not sure what is the patch 
> fixing? I assumed you are trying to avoid the write stuck in D forever, 
> which then prevents driver unload and everything, requiring the test 
> runner to eventually reboot. If you say SIGINT works then you can 
> already recover from userspace, no?
> 
>>> Whether or not 10s is enough CI will hopefully tell us. I'd probably 
>>> err on the side of safety and make it longer, but at most half from 
>>> the test runner timeout.
>> This is supposed to be test clean up. This is not about how long a 
>> particular test takes to complete but about how long it takes to 
>> declare the system broken after the test has already finished. I would 
>> argue that even 10s is massively longer than required.
>>
>>>
>>> I am not convinced that wedging is correct though. Conceptually could 
>>> be just that the timeout is too short. What does wedging really give 
>>> us, on top of limiting the wait, when latter AFAIU is the key factor 
>>> which would prevent the need to reboot the machine?
>>>
>> It gives us a system that knows what state it is in. If we can't idle 
>> the GT then something is very badly wrong. Wedging indicates that. It 
>> also ensure that a full GT reset will be attempted before the next 
>> test is run. Helping to prevent a failure on test X from propagating 
>> into failures of unrelated tests X+1, X+2, ... And if the GT reset 
>> does not work because the system is really that badly broken then 
>> future tests will not run rather than report erroneous failures.
>>
>> This is not about getting a more stable system for end users by 
>> sweeping issues under the carpet and pretending all is well. End users 
>> don't run IGTs or explicitly call dodgy debugfs entry points. The sole 
>> motivation here is to get more accurate results from CI. That is, 
>> correctly identifying which test has hit a problem, getting valid 
>> debug analysis for that test (logs and such) and allowing further 
>> testing to complete correctly in the case where the system can be 
>> recovered.
> 
> I don't really oppose shortening of the timeout in principle, just want 
> a clear statement if this is something IGT / test runner could already 
> do or not. It can apply a timeout, it can also send SIGINT, and it could 
> even trigger a reset from outside. Sure it is debugfs hacks so general 
> "kernel should not implement policy" need not be strictly followed, but 
> lets have it clear what are the options.

One conceptual problem with applying this policy is that the code is:

	if (val & (DROP_IDLE | DROP_ACTIVE)) {
		ret = intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
		if (ret)
			return ret;
	}

	if (val & DROP_IDLE) {
		ret = intel_gt_pm_wait_for_idle(gt);
		if (ret)
			return ret;
	}

So if someone passes in DROP_IDLE and then why would only the first 
branch have a short timeout and wedge. Yeah some bug happens to be there 
at the moment, but put a bug in a different place and you hang on the 
second branch and then need another patch. Versus perhaps making it all 
respect SIGINT and handle from outside.

Regards,

Tvrtko
Jani Nikula Nov. 3, 2022, 10:45 a.m. UTC | #6
On Wed, 02 Nov 2022, John Harrison <john.c.harrison@intel.com> wrote:
> On 11/2/2022 07:20, Tvrtko Ursulin wrote:
>> On 02/11/2022 12:12, Jani Nikula wrote:
>>> On Tue, 01 Nov 2022, John.C.Harrison@Intel.com wrote:
>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>
>>>> At the end of each test, IGT does a drop caches call via sysfs with
>>>
>>> sysfs?
> Sorry, that was meant to say debugfs. I've also been working on some 
> sysfs IGT issues and evidently got my wires crossed!
>
>>>
>>>> special flags set. One of the possible paths waits for idle with an
>>>> infinite timeout. That causes problems for debugging issues when CI
>>>> catches a "can't go idle" test failure. Best case, the CI system times
>>>> out (after 90s), attempts a bunch of state dump actions and then
>>>> reboots the system to recover it. Worst case, the CI system can't do
>>>> anything at all and then times out (after 1000s) and simply reboots.
>>>> Sometimes a serial port log of dmesg might be available, sometimes not.
>>>>
>>>> So rather than making life hard for ourselves, change the timeout to
>>>> be 10s rather than infinite. Also, trigger the standard
>>>> wedge/reset/recover sequence so that testing can continue with a
>>>> working system (if possible).
>>>>
>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>> ---
>>>>   drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
>>>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c 
>>>> b/drivers/gpu/drm/i915/i915_debugfs.c
>>>> index ae987e92251dd..9d916fbbfc27c 100644
>>>> --- a/drivers/gpu/drm/i915/i915_debugfs.c
>>>> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
>>>> @@ -641,6 +641,9 @@ DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
>>>>             DROP_RESET_ACTIVE | \
>>>>             DROP_RESET_SEQNO | \
>>>>             DROP_RCU)
>>>> +
>>>> +#define DROP_IDLE_TIMEOUT    (HZ * 10)
>>>
>>> I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's also only used
>>> here.
>>
>> So move here, dropping i915 prefix, next to the newly proposed one?
> Sure, can do that.
>
>>
>>> I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only used in
>>> gt/intel_gt.c.
>>
>> Move there and rename to GT_IDLE_TIMEOUT?
>>
>>> I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only in intel_gt_pm.c.
>>
>> No action needed, maybe drop i915 prefix if wanted.
>>
> These two are totally unrelated and in code not being touched by this 
> change. I would rather not conflate changing random other things with 
> fixing this specific issue.
>
>>> I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies.
>>
>> Add _MS suffix if wanted.
>>
>>> My head spins.
>>
>> I follow and raise that the newly proposed DROP_IDLE_TIMEOUT applies 
>> to DROP_ACTIVE and not only DROP_IDLE.
> My original intention for the name was that is the 'drop caches timeout 
> for intel_gt_wait_for_idle'. Which is quite the mouthful and hence 
> abbreviated to DROP_IDLE_TIMEOUT. But yes, I realised later that name 
> can be conflated with the DROP_IDLE flag. Will rename.
>
>
>>
>> Things get refactored, code moves around, bits get left behind, who 
>> knows. No reason to get too worked up. :) As long as people are taking 
>> a wider view when touching the code base, and are not afraid to send 
>> cleanups, things should be good.
> On the other hand, if every patch gets blocked in code review because 
> someone points out some completely unrelated piece of code could be a 
> bit better then nothing ever gets fixed. If you spot something that you 
> think should be improved, isn't the general idea that you should post a 
> patch yourself to improve it?

The general idea is that every change should improve the driver. If you
need to modify something that's a mess, you fix the mess instead of
adding to the mess. You can't put the onus on cleaning up after you on
someone else.

Sure, the patch at hand is neglible, but hey, so are the fixes.

BR,
Jani.


>
>
>>
>> For the actual functional change at hand - it would be nice if code 
>> paths in question could handle SIGINT and then we could punt the 
>> decision on how long someone wants to wait purely to userspace. But 
>> it's probably hard and it's only debugfs so whatever.
>>
> The code paths in question will already abort on a signal won't they? 
> Both intel_gt_wait_for_idle() and intel_guc_wait_for_pending_msg(), 
> which is where the uc_wait_for_idle eventually ends up, have an 
> 'if(signal_pending) return -EINTR;' check. Beyond that, it sounds like 
> what you are asking for is a change in the IGT libraries and/or CI 
> framework to start sending signals after some specific timeout. That 
> seems like a significantly more complex change (in terms of the number 
> of entities affected and number of groups involved) and unnecessary.
>
>> Whether or not 10s is enough CI will hopefully tell us. I'd probably 
>> err on the side of safety and make it longer, but at most half from 
>> the test runner timeout.
> This is supposed to be test clean up. This is not about how long a 
> particular test takes to complete but about how long it takes to declare 
> the system broken after the test has already finished. I would argue 
> that even 10s is massively longer than required.
>
>>
>> I am not convinced that wedging is correct though. Conceptually could 
>> be just that the timeout is too short. What does wedging really give 
>> us, on top of limiting the wait, when latter AFAIU is the key factor 
>> which would prevent the need to reboot the machine?
>>
> It gives us a system that knows what state it is in. If we can't idle 
> the GT then something is very badly wrong. Wedging indicates that. It 
> also ensure that a full GT reset will be attempted before the next test 
> is run. Helping to prevent a failure on test X from propagating into 
> failures of unrelated tests X+1, X+2, ... And if the GT reset does not 
> work because the system is really that badly broken then future tests 
> will not run rather than report erroneous failures.
>
> This is not about getting a more stable system for end users by sweeping 
> issues under the carpet and pretending all is well. End users don't run 
> IGTs or explicitly call dodgy debugfs entry points. The sole motivation 
> here is to get more accurate results from CI. That is, correctly 
> identifying which test has hit a problem, getting valid debug analysis 
> for that test (logs and such) and allowing further testing to complete 
> correctly in the case where the system can be recovered.
>
> John.
>
>> Regards,
>>
>> Tvrtko
>
John Harrison Nov. 3, 2022, 7:16 p.m. UTC | #7
On 11/3/2022 02:38, Tvrtko Ursulin wrote:
> On 03/11/2022 09:18, Tvrtko Ursulin wrote:
>> On 03/11/2022 01:33, John Harrison wrote:
>>> On 11/2/2022 07:20, Tvrtko Ursulin wrote:
>>>> On 02/11/2022 12:12, Jani Nikula wrote:
>>>>> On Tue, 01 Nov 2022, John.C.Harrison@Intel.com wrote:
>>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>>
>>>>>> At the end of each test, IGT does a drop caches call via sysfs with
>>>>>
>>>>> sysfs?
>>> Sorry, that was meant to say debugfs. I've also been working on some 
>>> sysfs IGT issues and evidently got my wires crossed!
>>>
>>>>>
>>>>>> special flags set. One of the possible paths waits for idle with an
>>>>>> infinite timeout. That causes problems for debugging issues when CI
>>>>>> catches a "can't go idle" test failure. Best case, the CI system 
>>>>>> times
>>>>>> out (after 90s), attempts a bunch of state dump actions and then
>>>>>> reboots the system to recover it. Worst case, the CI system can't do
>>>>>> anything at all and then times out (after 1000s) and simply reboots.
>>>>>> Sometimes a serial port log of dmesg might be available, 
>>>>>> sometimes not.
>>>>>>
>>>>>> So rather than making life hard for ourselves, change the timeout to
>>>>>> be 10s rather than infinite. Also, trigger the standard
>>>>>> wedge/reset/recover sequence so that testing can continue with a
>>>>>> working system (if possible).
>>>>>>
>>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
>>>>>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c 
>>>>>> b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>> index ae987e92251dd..9d916fbbfc27c 100644
>>>>>> --- a/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>> @@ -641,6 +641,9 @@ 
>>>>>> DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
>>>>>>             DROP_RESET_ACTIVE | \
>>>>>>             DROP_RESET_SEQNO | \
>>>>>>             DROP_RCU)
>>>>>> +
>>>>>> +#define DROP_IDLE_TIMEOUT    (HZ * 10)
>>>>>
>>>>> I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's also only 
>>>>> used
>>>>> here.
>>>>
>>>> So move here, dropping i915 prefix, next to the newly proposed one?
>>> Sure, can do that.
>>>
>>>>
>>>>> I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only used in
>>>>> gt/intel_gt.c.
>>>>
>>>> Move there and rename to GT_IDLE_TIMEOUT?
>>>>
>>>>> I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only in 
>>>>> intel_gt_pm.c.
>>>>
>>>> No action needed, maybe drop i915 prefix if wanted.
>>>>
>>> These two are totally unrelated and in code not being touched by 
>>> this change. I would rather not conflate changing random other 
>>> things with fixing this specific issue.
>>>
>>>>> I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies.
>>>>
>>>> Add _MS suffix if wanted.
>>>>
>>>>> My head spins.
>>>>
>>>> I follow and raise that the newly proposed DROP_IDLE_TIMEOUT 
>>>> applies to DROP_ACTIVE and not only DROP_IDLE.
>>> My original intention for the name was that is the 'drop caches 
>>> timeout for intel_gt_wait_for_idle'. Which is quite the mouthful and 
>>> hence abbreviated to DROP_IDLE_TIMEOUT. But yes, I realised later 
>>> that name can be conflated with the DROP_IDLE flag. Will rename.
>>>
>>>
>>>>
>>>> Things get refactored, code moves around, bits get left behind, who 
>>>> knows. No reason to get too worked up. :) As long as people are 
>>>> taking a wider view when touching the code base, and are not afraid 
>>>> to send cleanups, things should be good.
>>> On the other hand, if every patch gets blocked in code review 
>>> because someone points out some completely unrelated piece of code 
>>> could be a bit better then nothing ever gets fixed. If you spot 
>>> something that you think should be improved, isn't the general idea 
>>> that you should post a patch yourself to improve it?
>>
>> There's two maintainers per branch and an order of magnitude or two 
>> more developers so it'd be nice if cleanups would just be incoming on 
>> self-initiative basis. ;)
>>
>>>> For the actual functional change at hand - it would be nice if code 
>>>> paths in question could handle SIGINT and then we could punt the 
>>>> decision on how long someone wants to wait purely to userspace. But 
>>>> it's probably hard and it's only debugfs so whatever.
>>>>
>>> The code paths in question will already abort on a signal won't 
>>> they? Both intel_gt_wait_for_idle() and 
>>> intel_guc_wait_for_pending_msg(), which is where the 
>>> uc_wait_for_idle eventually ends up, have an 'if(signal_pending) 
>>> return -EINTR;' check. Beyond that, it sounds like what you are 
>>> asking for is a change in the IGT libraries and/or CI framework to 
>>> start sending signals after some specific timeout. That seems like a 
>>> significantly more complex change (in terms of the number of 
>>> entities affected and number of groups involved) and unnecessary.
>>
>> If you say so, I haven't looked at them all. But if the code path in 
>> question already aborts on signals then I am not sure what is the 
>> patch fixing? I assumed you are trying to avoid the write stuck in D 
>> forever, which then prevents driver unload and everything, requiring 
>> the test runner to eventually reboot. If you say SIGINT works then 
>> you can already recover from userspace, no?
>>
>>>> Whether or not 10s is enough CI will hopefully tell us. I'd 
>>>> probably err on the side of safety and make it longer, but at most 
>>>> half from the test runner timeout.
>>> This is supposed to be test clean up. This is not about how long a 
>>> particular test takes to complete but about how long it takes to 
>>> declare the system broken after the test has already finished. I 
>>> would argue that even 10s is massively longer than required.
>>>
>>>>
>>>> I am not convinced that wedging is correct though. Conceptually 
>>>> could be just that the timeout is too short. What does wedging 
>>>> really give us, on top of limiting the wait, when latter AFAIU is 
>>>> the key factor which would prevent the need to reboot the machine?
>>>>
>>> It gives us a system that knows what state it is in. If we can't 
>>> idle the GT then something is very badly wrong. Wedging indicates 
>>> that. It also ensure that a full GT reset will be attempted before 
>>> the next test is run. Helping to prevent a failure on test X from 
>>> propagating into failures of unrelated tests X+1, X+2, ... And if 
>>> the GT reset does not work because the system is really that badly 
>>> broken then future tests will not run rather than report erroneous 
>>> failures.
>>>
>>> This is not about getting a more stable system for end users by 
>>> sweeping issues under the carpet and pretending all is well. End 
>>> users don't run IGTs or explicitly call dodgy debugfs entry points. 
>>> The sole motivation here is to get more accurate results from CI. 
>>> That is, correctly identifying which test has hit a problem, getting 
>>> valid debug analysis for that test (logs and such) and allowing 
>>> further testing to complete correctly in the case where the system 
>>> can be recovered.
>>
>> I don't really oppose shortening of the timeout in principle, just 
>> want a clear statement if this is something IGT / test runner could 
>> already do or not. It can apply a timeout, it can also send SIGINT, 
>> and it could even trigger a reset from outside. Sure it is debugfs 
>> hacks so general "kernel should not implement policy" need not be 
>> strictly followed, but lets have it clear what are the options.
>
> One conceptual problem with applying this policy is that the code is:
>
>     if (val & (DROP_IDLE | DROP_ACTIVE)) {
>         ret = intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
>         if (ret)
>             return ret;
>     }
>
>     if (val & DROP_IDLE) {
>         ret = intel_gt_pm_wait_for_idle(gt);
>         if (ret)
>             return ret;
>     }
>
> So if someone passes in DROP_IDLE and then why would only the first 
> branch have a short timeout and wedge. Yeah some bug happens to be 
> there at the moment, but put a bug in a different place and you hang 
> on the second branch and then need another patch. Versus perhaps 
> making it all respect SIGINT and handle from outside.
>
The pm_wait_for_idle is can only called after gt_wait_for_idle has 
completed successfully. There is no route to skip the GT idle or to do 
the PM idle even if the GT idle fails. So the chances of the PM idle 
failing are greatly reduced. There would have to be something outside of 
a GT keeping the GPU awake and there isn't a whole lot of hardware left 
at that point!

Regarding signals, the PM idle code ends up at 
wait_var_event_killable(). I assume that is interruptible via at least a 
KILL signal if not any signal. Although it's not entirely clear trying 
to follow through the implementation of this code. Also, I have no idea 
if there is a safe way to add a timeout to that code (or why it wasn't 
already written with a timeout included). Someone more familiar with the 
wakeref internals would need to comment.

However, I strongly disagree that we should not fix the driver just 
because it is possible to workaround the issue by re-writing the CI 
framework. Feel free to bring a redesign plan to the IGT WG and whatever 
equivalent CI meetings in parallel. But we absolutely should not have 
infinite waits in the kernel if there is a trivial way to not have 
infinite waits.

Also, sending a signal does not result in the wedge happening. I 
specifically did not want to change that code path because I was 
assuming there was a valid reason for it. If you have been interrupted 
then you are in the territory of maybe it would have succeeded if you 
just left it for a moment longer. Whereas, hitting the timeout says that 
someone very deliberately said this is too long to wait and therefore 
the system must be broken.

Plus, infinite wait is not a valid code path in the first place so any 
change in behaviour is not really a change in behaviour. Code can't be 
relying on a kernel call to never return for its correct operation!

And if you don't wedge then you don't recover. Each subsequent test 
would just hit the infinite timeout, get killed by the CI framework's 
shiny new kill signal and be marked as yet another unrelated bug that 
will be logged separately. Whereas, using a sensible timeout and then 
wedging will at least attempt to recover the situation. And if it can be 
recovered, future tests will pass. If it can't then future testing will 
be aborted.

John.


> Regards,
>
> Tvrtko
John Harrison Nov. 3, 2022, 7:37 p.m. UTC | #8
On 11/3/2022 02:18, Tvrtko Ursulin wrote:
> On 03/11/2022 01:33, John Harrison wrote:
>> On 11/2/2022 07:20, Tvrtko Ursulin wrote:
>>> On 02/11/2022 12:12, Jani Nikula wrote:
>>>> On Tue, 01 Nov 2022, John.C.Harrison@Intel.com wrote:
>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>
>>>>> At the end of each test, IGT does a drop caches call via sysfs with
>>>>
>>>> sysfs?
>> Sorry, that was meant to say debugfs. I've also been working on some 
>> sysfs IGT issues and evidently got my wires crossed!
>>
>>>>
>>>>> special flags set. One of the possible paths waits for idle with an
>>>>> infinite timeout. That causes problems for debugging issues when CI
>>>>> catches a "can't go idle" test failure. Best case, the CI system 
>>>>> times
>>>>> out (after 90s), attempts a bunch of state dump actions and then
>>>>> reboots the system to recover it. Worst case, the CI system can't do
>>>>> anything at all and then times out (after 1000s) and simply reboots.
>>>>> Sometimes a serial port log of dmesg might be available, sometimes 
>>>>> not.
>>>>>
>>>>> So rather than making life hard for ourselves, change the timeout to
>>>>> be 10s rather than infinite. Also, trigger the standard
>>>>> wedge/reset/recover sequence so that testing can continue with a
>>>>> working system (if possible).
>>>>>
>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>>> ---
>>>>>   drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
>>>>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c 
>>>>> b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>> index ae987e92251dd..9d916fbbfc27c 100644
>>>>> --- a/drivers/gpu/drm/i915/i915_debugfs.c
>>>>> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>> @@ -641,6 +641,9 @@ DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
>>>>>             DROP_RESET_ACTIVE | \
>>>>>             DROP_RESET_SEQNO | \
>>>>>             DROP_RCU)
>>>>> +
>>>>> +#define DROP_IDLE_TIMEOUT    (HZ * 10)
>>>>
>>>> I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's also only 
>>>> used
>>>> here.
>>>
>>> So move here, dropping i915 prefix, next to the newly proposed one?
>> Sure, can do that.
>>
>>>
>>>> I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only used in
>>>> gt/intel_gt.c.
>>>
>>> Move there and rename to GT_IDLE_TIMEOUT?
>>>
>>>> I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only in 
>>>> intel_gt_pm.c.
>>>
>>> No action needed, maybe drop i915 prefix if wanted.
>>>
>> These two are totally unrelated and in code not being touched by this 
>> change. I would rather not conflate changing random other things with 
>> fixing this specific issue.
>>
>>>> I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies.
>>>
>>> Add _MS suffix if wanted.
>>>
>>>> My head spins.
>>>
>>> I follow and raise that the newly proposed DROP_IDLE_TIMEOUT applies 
>>> to DROP_ACTIVE and not only DROP_IDLE.
>> My original intention for the name was that is the 'drop caches 
>> timeout for intel_gt_wait_for_idle'. Which is quite the mouthful and 
>> hence abbreviated to DROP_IDLE_TIMEOUT. But yes, I realised later 
>> that name can be conflated with the DROP_IDLE flag. Will rename.
>>
>>
>>>
>>> Things get refactored, code moves around, bits get left behind, who 
>>> knows. No reason to get too worked up. :) As long as people are 
>>> taking a wider view when touching the code base, and are not afraid 
>>> to send cleanups, things should be good.
>> On the other hand, if every patch gets blocked in code review because 
>> someone points out some completely unrelated piece of code could be a 
>> bit better then nothing ever gets fixed. If you spot something that 
>> you think should be improved, isn't the general idea that you should 
>> post a patch yourself to improve it?
>
> There's two maintainers per branch and an order of magnitude or two 
> more developers so it'd be nice if cleanups would just be incoming on 
> self-initiative basis. ;)
It's not just maintainers that look at the code and spot problems. Where 
do you think patch set came from? It was not on my list of tasks to work 
on. No-one had logged this as a super urgent bug that needs to be fixed. 
I noticed a problem when trying to debug another issue and saw a way to 
improve the i915 debuggability. So I tried to fix it on a 
'self-initiative basis'. And already that trivial fix has ballooned into 
I don't know how many hours of work that has not been spent on doing the 
things I'm actually supposed to working on.

Likewise, the a bunch of other patches I have recently posted. They are 
just things I happened to spot and spontaneously decided to fix.

And if you don't have time to fix something yourself, you can always 
just log it as a piece of work that needs to be done and add it to the 
backlog of tasks. It will then get assigned to whoever actually has the 
time to do it according to how important it really is.

John.


>
>>> For the actual functional change at hand - it would be nice if code 
>>> paths in question could handle SIGINT and then we could punt the 
>>> decision on how long someone wants to wait purely to userspace. But 
>>> it's probably hard and it's only debugfs so whatever.
>>>
>> The code paths in question will already abort on a signal won't they? 
>> Both intel_gt_wait_for_idle() and intel_guc_wait_for_pending_msg(), 
>> which is where the uc_wait_for_idle eventually ends up, have an 
>> 'if(signal_pending) return -EINTR;' check. Beyond that, it sounds 
>> like what you are asking for is a change in the IGT libraries and/or 
>> CI framework to start sending signals after some specific timeout. 
>> That seems like a significantly more complex change (in terms of the 
>> number of entities affected and number of groups involved) and 
>> unnecessary.
>
> If you say so, I haven't looked at them all. But if the code path in 
> question already aborts on signals then I am not sure what is the 
> patch fixing? I assumed you are trying to avoid the write stuck in D 
> forever, which then prevents driver unload and everything, requiring 
> the test runner to eventually reboot. If you say SIGINT works then you 
> can already recover from userspace, no?
>
>>> Whether or not 10s is enough CI will hopefully tell us. I'd probably 
>>> err on the side of safety and make it longer, but at most half from 
>>> the test runner timeout.
>> This is supposed to be test clean up. This is not about how long a 
>> particular test takes to complete but about how long it takes to 
>> declare the system broken after the test has already finished. I 
>> would argue that even 10s is massively longer than required.
>>
>>>
>>> I am not convinced that wedging is correct though. Conceptually 
>>> could be just that the timeout is too short. What does wedging 
>>> really give us, on top of limiting the wait, when latter AFAIU is 
>>> the key factor which would prevent the need to reboot the machine?
>>>
>> It gives us a system that knows what state it is in. If we can't idle 
>> the GT then something is very badly wrong. Wedging indicates that. It 
>> also ensure that a full GT reset will be attempted before the next 
>> test is run. Helping to prevent a failure on test X from propagating 
>> into failures of unrelated tests X+1, X+2, ... And if the GT reset 
>> does not work because the system is really that badly broken then 
>> future tests will not run rather than report erroneous failures.
>>
>> This is not about getting a more stable system for end users by 
>> sweeping issues under the carpet and pretending all is well. End 
>> users don't run IGTs or explicitly call dodgy debugfs entry points. 
>> The sole motivation here is to get more accurate results from CI. 
>> That is, correctly identifying which test has hit a problem, getting 
>> valid debug analysis for that test (logs and such) and allowing 
>> further testing to complete correctly in the case where the system 
>> can be recovered.
>
> I don't really oppose shortening of the timeout in principle, just 
> want a clear statement if this is something IGT / test runner could 
> already do or not. It can apply a timeout, it can also send SIGINT, 
> and it could even trigger a reset from outside. Sure it is debugfs 
> hacks so general "kernel should not implement policy" need not be 
> strictly followed, but lets have it clear what are the options.
>
> Regards,
>
> Tvrtko
John Harrison Nov. 3, 2022, 7:39 p.m. UTC | #9
On 11/3/2022 03:45, Jani Nikula wrote:
> On Wed, 02 Nov 2022, John Harrison <john.c.harrison@intel.com> wrote:
>> On 11/2/2022 07:20, Tvrtko Ursulin wrote:
>>> On 02/11/2022 12:12, Jani Nikula wrote:
>>>> On Tue, 01 Nov 2022, John.C.Harrison@Intel.com wrote:
>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>
>>>>> At the end of each test, IGT does a drop caches call via sysfs with
>>>> sysfs?
>> Sorry, that was meant to say debugfs. I've also been working on some
>> sysfs IGT issues and evidently got my wires crossed!
>>
>>>>> special flags set. One of the possible paths waits for idle with an
>>>>> infinite timeout. That causes problems for debugging issues when CI
>>>>> catches a "can't go idle" test failure. Best case, the CI system times
>>>>> out (after 90s), attempts a bunch of state dump actions and then
>>>>> reboots the system to recover it. Worst case, the CI system can't do
>>>>> anything at all and then times out (after 1000s) and simply reboots.
>>>>> Sometimes a serial port log of dmesg might be available, sometimes not.
>>>>>
>>>>> So rather than making life hard for ourselves, change the timeout to
>>>>> be 10s rather than infinite. Also, trigger the standard
>>>>> wedge/reset/recover sequence so that testing can continue with a
>>>>> working system (if possible).
>>>>>
>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>>> ---
>>>>>    drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
>>>>>    1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c
>>>>> b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>> index ae987e92251dd..9d916fbbfc27c 100644
>>>>> --- a/drivers/gpu/drm/i915/i915_debugfs.c
>>>>> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>> @@ -641,6 +641,9 @@ DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
>>>>>              DROP_RESET_ACTIVE | \
>>>>>              DROP_RESET_SEQNO | \
>>>>>              DROP_RCU)
>>>>> +
>>>>> +#define DROP_IDLE_TIMEOUT    (HZ * 10)
>>>> I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's also only used
>>>> here.
>>> So move here, dropping i915 prefix, next to the newly proposed one?
>> Sure, can do that.
>>
>>>> I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only used in
>>>> gt/intel_gt.c.
>>> Move there and rename to GT_IDLE_TIMEOUT?
>>>
>>>> I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only in intel_gt_pm.c.
>>> No action needed, maybe drop i915 prefix if wanted.
>>>
>> These two are totally unrelated and in code not being touched by this
>> change. I would rather not conflate changing random other things with
>> fixing this specific issue.
>>
>>>> I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies.
>>> Add _MS suffix if wanted.
>>>
>>>> My head spins.
>>> I follow and raise that the newly proposed DROP_IDLE_TIMEOUT applies
>>> to DROP_ACTIVE and not only DROP_IDLE.
>> My original intention for the name was that is the 'drop caches timeout
>> for intel_gt_wait_for_idle'. Which is quite the mouthful and hence
>> abbreviated to DROP_IDLE_TIMEOUT. But yes, I realised later that name
>> can be conflated with the DROP_IDLE flag. Will rename.
>>
>>
>>> Things get refactored, code moves around, bits get left behind, who
>>> knows. No reason to get too worked up. :) As long as people are taking
>>> a wider view when touching the code base, and are not afraid to send
>>> cleanups, things should be good.
>> On the other hand, if every patch gets blocked in code review because
>> someone points out some completely unrelated piece of code could be a
>> bit better then nothing ever gets fixed. If you spot something that you
>> think should be improved, isn't the general idea that you should post a
>> patch yourself to improve it?
> The general idea is that every change should improve the driver. If you
> need to modify something that's a mess, you fix the mess instead of
> adding to the mess. You can't put the onus on cleaning up after you on
> someone else.
Please point out in what way this patch is 'adding to the mess' or 
requiring some else to do additional 'cleaning up after'. As stated 
above, I have fixed up the issues pointed out which are related to the 
drop caches code. I don't agree that shoe-horning completely unrelated 
changes into random patches is a good thing. That makes it hard to work 
out what the patch is actually trying to do, it makes bisection more 
confusing, etc. Sure, maybe the unrelated change is trivial. But this 
change was supposed to be trivial too and already it has exploded into 
many hours of time spent not working on the things I am actually 
supposed to be working on.

> Sure, the patch at hand is neglible, but hey, so are the fixes.
So feel free to post a trivial patch to fix them. And if 'the patch at 
hand is negligible' then why is it generating so much discussion and 
argument over how it the problem should be fixed irrespective of adding 
in yet more unrelated changes?

John.

> BR,
> Jani.
>
>
>>
>>> For the actual functional change at hand - it would be nice if code
>>> paths in question could handle SIGINT and then we could punt the
>>> decision on how long someone wants to wait purely to userspace. But
>>> it's probably hard and it's only debugfs so whatever.
>>>
>> The code paths in question will already abort on a signal won't they?
>> Both intel_gt_wait_for_idle() and intel_guc_wait_for_pending_msg(),
>> which is where the uc_wait_for_idle eventually ends up, have an
>> 'if(signal_pending) return -EINTR;' check. Beyond that, it sounds like
>> what you are asking for is a change in the IGT libraries and/or CI
>> framework to start sending signals after some specific timeout. That
>> seems like a significantly more complex change (in terms of the number
>> of entities affected and number of groups involved) and unnecessary.
>>
>>> Whether or not 10s is enough CI will hopefully tell us. I'd probably
>>> err on the side of safety and make it longer, but at most half from
>>> the test runner timeout.
>> This is supposed to be test clean up. This is not about how long a
>> particular test takes to complete but about how long it takes to declare
>> the system broken after the test has already finished. I would argue
>> that even 10s is massively longer than required.
>>
>>> I am not convinced that wedging is correct though. Conceptually could
>>> be just that the timeout is too short. What does wedging really give
>>> us, on top of limiting the wait, when latter AFAIU is the key factor
>>> which would prevent the need to reboot the machine?
>>>
>> It gives us a system that knows what state it is in. If we can't idle
>> the GT then something is very badly wrong. Wedging indicates that. It
>> also ensure that a full GT reset will be attempted before the next test
>> is run. Helping to prevent a failure on test X from propagating into
>> failures of unrelated tests X+1, X+2, ... And if the GT reset does not
>> work because the system is really that badly broken then future tests
>> will not run rather than report erroneous failures.
>>
>> This is not about getting a more stable system for end users by sweeping
>> issues under the carpet and pretending all is well. End users don't run
>> IGTs or explicitly call dodgy debugfs entry points. The sole motivation
>> here is to get more accurate results from CI. That is, correctly
>> identifying which test has hit a problem, getting valid debug analysis
>> for that test (logs and such) and allowing further testing to complete
>> correctly in the case where the system can be recovered.
>>
>> John.
>>
>>> Regards,
>>>
>>> Tvrtko
Tvrtko Ursulin Nov. 4, 2022, 10:01 a.m. UTC | #10
On 03/11/2022 19:16, John Harrison wrote:
> On 11/3/2022 02:38, Tvrtko Ursulin wrote:
>> On 03/11/2022 09:18, Tvrtko Ursulin wrote:
>>> On 03/11/2022 01:33, John Harrison wrote:
>>>> On 11/2/2022 07:20, Tvrtko Ursulin wrote:
>>>>> On 02/11/2022 12:12, Jani Nikula wrote:
>>>>>> On Tue, 01 Nov 2022, John.C.Harrison@Intel.com wrote:
>>>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>>>
>>>>>>> At the end of each test, IGT does a drop caches call via sysfs with
>>>>>>
>>>>>> sysfs?
>>>> Sorry, that was meant to say debugfs. I've also been working on some 
>>>> sysfs IGT issues and evidently got my wires crossed!
>>>>
>>>>>>
>>>>>>> special flags set. One of the possible paths waits for idle with an
>>>>>>> infinite timeout. That causes problems for debugging issues when CI
>>>>>>> catches a "can't go idle" test failure. Best case, the CI system 
>>>>>>> times
>>>>>>> out (after 90s), attempts a bunch of state dump actions and then
>>>>>>> reboots the system to recover it. Worst case, the CI system can't do
>>>>>>> anything at all and then times out (after 1000s) and simply reboots.
>>>>>>> Sometimes a serial port log of dmesg might be available, 
>>>>>>> sometimes not.
>>>>>>>
>>>>>>> So rather than making life hard for ourselves, change the timeout to
>>>>>>> be 10s rather than infinite. Also, trigger the standard
>>>>>>> wedge/reset/recover sequence so that testing can continue with a
>>>>>>> working system (if possible).
>>>>>>>
>>>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>>>>> ---
>>>>>>>   drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
>>>>>>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c 
>>>>>>> b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>> index ae987e92251dd..9d916fbbfc27c 100644
>>>>>>> --- a/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>> @@ -641,6 +641,9 @@ 
>>>>>>> DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
>>>>>>>             DROP_RESET_ACTIVE | \
>>>>>>>             DROP_RESET_SEQNO | \
>>>>>>>             DROP_RCU)
>>>>>>> +
>>>>>>> +#define DROP_IDLE_TIMEOUT    (HZ * 10)
>>>>>>
>>>>>> I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's also only 
>>>>>> used
>>>>>> here.
>>>>>
>>>>> So move here, dropping i915 prefix, next to the newly proposed one?
>>>> Sure, can do that.
>>>>
>>>>>
>>>>>> I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only used in
>>>>>> gt/intel_gt.c.
>>>>>
>>>>> Move there and rename to GT_IDLE_TIMEOUT?
>>>>>
>>>>>> I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only in 
>>>>>> intel_gt_pm.c.
>>>>>
>>>>> No action needed, maybe drop i915 prefix if wanted.
>>>>>
>>>> These two are totally unrelated and in code not being touched by 
>>>> this change. I would rather not conflate changing random other 
>>>> things with fixing this specific issue.
>>>>
>>>>>> I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies.
>>>>>
>>>>> Add _MS suffix if wanted.
>>>>>
>>>>>> My head spins.
>>>>>
>>>>> I follow and raise that the newly proposed DROP_IDLE_TIMEOUT 
>>>>> applies to DROP_ACTIVE and not only DROP_IDLE.
>>>> My original intention for the name was that is the 'drop caches 
>>>> timeout for intel_gt_wait_for_idle'. Which is quite the mouthful and 
>>>> hence abbreviated to DROP_IDLE_TIMEOUT. But yes, I realised later 
>>>> that name can be conflated with the DROP_IDLE flag. Will rename.
>>>>
>>>>
>>>>>
>>>>> Things get refactored, code moves around, bits get left behind, who 
>>>>> knows. No reason to get too worked up. :) As long as people are 
>>>>> taking a wider view when touching the code base, and are not afraid 
>>>>> to send cleanups, things should be good.
>>>> On the other hand, if every patch gets blocked in code review 
>>>> because someone points out some completely unrelated piece of code 
>>>> could be a bit better then nothing ever gets fixed. If you spot 
>>>> something that you think should be improved, isn't the general idea 
>>>> that you should post a patch yourself to improve it?
>>>
>>> There's two maintainers per branch and an order of magnitude or two 
>>> more developers so it'd be nice if cleanups would just be incoming on 
>>> self-initiative basis. ;)
>>>
>>>>> For the actual functional change at hand - it would be nice if code 
>>>>> paths in question could handle SIGINT and then we could punt the 
>>>>> decision on how long someone wants to wait purely to userspace. But 
>>>>> it's probably hard and it's only debugfs so whatever.
>>>>>
>>>> The code paths in question will already abort on a signal won't 
>>>> they? Both intel_gt_wait_for_idle() and 
>>>> intel_guc_wait_for_pending_msg(), which is where the 
>>>> uc_wait_for_idle eventually ends up, have an 'if(signal_pending) 
>>>> return -EINTR;' check. Beyond that, it sounds like what you are 
>>>> asking for is a change in the IGT libraries and/or CI framework to 
>>>> start sending signals after some specific timeout. That seems like a 
>>>> significantly more complex change (in terms of the number of 
>>>> entities affected and number of groups involved) and unnecessary.
>>>
>>> If you say so, I haven't looked at them all. But if the code path in 
>>> question already aborts on signals then I am not sure what is the 
>>> patch fixing? I assumed you are trying to avoid the write stuck in D 
>>> forever, which then prevents driver unload and everything, requiring 
>>> the test runner to eventually reboot. If you say SIGINT works then 
>>> you can already recover from userspace, no?
>>>
>>>>> Whether or not 10s is enough CI will hopefully tell us. I'd 
>>>>> probably err on the side of safety and make it longer, but at most 
>>>>> half from the test runner timeout.
>>>> This is supposed to be test clean up. This is not about how long a 
>>>> particular test takes to complete but about how long it takes to 
>>>> declare the system broken after the test has already finished. I 
>>>> would argue that even 10s is massively longer than required.
>>>>
>>>>>
>>>>> I am not convinced that wedging is correct though. Conceptually 
>>>>> could be just that the timeout is too short. What does wedging 
>>>>> really give us, on top of limiting the wait, when latter AFAIU is 
>>>>> the key factor which would prevent the need to reboot the machine?
>>>>>
>>>> It gives us a system that knows what state it is in. If we can't 
>>>> idle the GT then something is very badly wrong. Wedging indicates 
>>>> that. It also ensure that a full GT reset will be attempted before 
>>>> the next test is run. Helping to prevent a failure on test X from 
>>>> propagating into failures of unrelated tests X+1, X+2, ... And if 
>>>> the GT reset does not work because the system is really that badly 
>>>> broken then future tests will not run rather than report erroneous 
>>>> failures.
>>>>
>>>> This is not about getting a more stable system for end users by 
>>>> sweeping issues under the carpet and pretending all is well. End 
>>>> users don't run IGTs or explicitly call dodgy debugfs entry points. 
>>>> The sole motivation here is to get more accurate results from CI. 
>>>> That is, correctly identifying which test has hit a problem, getting 
>>>> valid debug analysis for that test (logs and such) and allowing 
>>>> further testing to complete correctly in the case where the system 
>>>> can be recovered.
>>>
>>> I don't really oppose shortening of the timeout in principle, just 
>>> want a clear statement if this is something IGT / test runner could 
>>> already do or not. It can apply a timeout, it can also send SIGINT, 
>>> and it could even trigger a reset from outside. Sure it is debugfs 
>>> hacks so general "kernel should not implement policy" need not be 
>>> strictly followed, but lets have it clear what are the options.
>>
>> One conceptual problem with applying this policy is that the code is:
>>
>>     if (val & (DROP_IDLE | DROP_ACTIVE)) {
>>         ret = intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
>>         if (ret)
>>             return ret;
>>     }
>>
>>     if (val & DROP_IDLE) {
>>         ret = intel_gt_pm_wait_for_idle(gt);
>>         if (ret)
>>             return ret;
>>     }
>>
>> So if someone passes in DROP_IDLE and then why would only the first 
>> branch have a short timeout and wedge. Yeah some bug happens to be 
>> there at the moment, but put a bug in a different place and you hang 
>> on the second branch and then need another patch. Versus perhaps 
>> making it all respect SIGINT and handle from outside.
>>
> The pm_wait_for_idle is can only called after gt_wait_for_idle has 
> completed successfully. There is no route to skip the GT idle or to do 
> the PM idle even if the GT idle fails. So the chances of the PM idle 
> failing are greatly reduced. There would have to be something outside of 
> a GT keeping the GPU awake and there isn't a whole lot of hardware left 
> at that point!

Well "greatly reduced" is beside my point. Point is today bug is here 
and we add a timeout, tomorrow bug is there and then the same dance. It 
can be just a sw bug which forgets to release the pm ref in some 
circumstances, doesn't really matter.

> Regarding signals, the PM idle code ends up at 
> wait_var_event_killable(). I assume that is interruptible via at least a 
> KILL signal if not any signal. Although it's not entirely clear trying 
> to follow through the implementation of this code. Also, I have no idea 
> if there is a safe way to add a timeout to that code (or why it wasn't 
> already written with a timeout included). Someone more familiar with the 
> wakeref internals would need to comment.
> 
> However, I strongly disagree that we should not fix the driver just 
> because it is possible to workaround the issue by re-writing the CI 
> framework. Feel free to bring a redesign plan to the IGT WG and whatever 
> equivalent CI meetings in parallel. But we absolutely should not have 
> infinite waits in the kernel if there is a trivial way to not have 
> infinite waits.

I thought I was clear that I am not really opposed to the timeout.

The rest of the paragraph I don't really care - point is moot because 
it's debugfs so we can do whatever, as long as it is not burdensome to 
i915, which this isn't. If either wasn't the case then we certainly 
wouldn't be adding any workarounds in the kernel if it can be achieved 
in IGT.

> Also, sending a signal does not result in the wedge happening. I 
> specifically did not want to change that code path because I was 
> assuming there was a valid reason for it. If you have been interrupted 
> then you are in the territory of maybe it would have succeeded if you 
> just left it for a moment longer. Whereas, hitting the timeout says that 
> someone very deliberately said this is too long to wait and therefore 
> the system must be broken.

I wanted to know specifically about wedging - why can't you wedge/reset 
from IGT if DROP_IDLE times out in quiescent or wherever, if that's what 
you say is the right thing? That's a policy decision so why would i915 
wedge if an arbitrary timeout expired? I915 is not controlling how much 
work there is outstanding at the point the IGT decides to call DROP_IDLE.

> Plus, infinite wait is not a valid code path in the first place so any 
> change in behaviour is not really a change in behaviour. Code can't be 
> relying on a kernel call to never return for its correct operation!

Why infinite wait wouldn't be valid? Then you better change the other 
one as well. ;P

Regards,

Tvrtko

> And if you don't wedge then you don't recover. Each subsequent test 
> would just hit the infinite timeout, get killed by the CI framework's 
> shiny new kill signal and be marked as yet another unrelated bug that 
> will be logged separately. Whereas, using a sensible timeout and then 
> wedging will at least attempt to recover the situation. And if it can be 
> recovered, future tests will pass. If it can't then future testing will 
> be aborted.
> 
> John.
> 
> 
>> Regards,
>>
>> Tvrtko
>
John Harrison Nov. 4, 2022, 5:45 p.m. UTC | #11
On 11/4/2022 03:01, Tvrtko Ursulin wrote:
> On 03/11/2022 19:16, John Harrison wrote:
>> On 11/3/2022 02:38, Tvrtko Ursulin wrote:
>>> On 03/11/2022 09:18, Tvrtko Ursulin wrote:
>>>> On 03/11/2022 01:33, John Harrison wrote:
>>>>> On 11/2/2022 07:20, Tvrtko Ursulin wrote:
>>>>>> On 02/11/2022 12:12, Jani Nikula wrote:
>>>>>>> On Tue, 01 Nov 2022, John.C.Harrison@Intel.com wrote:
>>>>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>>>>
>>>>>>>> At the end of each test, IGT does a drop caches call via sysfs 
>>>>>>>> with
>>>>>>>
>>>>>>> sysfs?
>>>>> Sorry, that was meant to say debugfs. I've also been working on 
>>>>> some sysfs IGT issues and evidently got my wires crossed!
>>>>>
>>>>>>>
>>>>>>>> special flags set. One of the possible paths waits for idle 
>>>>>>>> with an
>>>>>>>> infinite timeout. That causes problems for debugging issues 
>>>>>>>> when CI
>>>>>>>> catches a "can't go idle" test failure. Best case, the CI 
>>>>>>>> system times
>>>>>>>> out (after 90s), attempts a bunch of state dump actions and then
>>>>>>>> reboots the system to recover it. Worst case, the CI system 
>>>>>>>> can't do
>>>>>>>> anything at all and then times out (after 1000s) and simply 
>>>>>>>> reboots.
>>>>>>>> Sometimes a serial port log of dmesg might be available, 
>>>>>>>> sometimes not.
>>>>>>>>
>>>>>>>> So rather than making life hard for ourselves, change the 
>>>>>>>> timeout to
>>>>>>>> be 10s rather than infinite. Also, trigger the standard
>>>>>>>> wedge/reset/recover sequence so that testing can continue with a
>>>>>>>> working system (if possible).
>>>>>>>>
>>>>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>>>>>> ---
>>>>>>>>   drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
>>>>>>>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c 
>>>>>>>> b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>> index ae987e92251dd..9d916fbbfc27c 100644
>>>>>>>> --- a/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>> @@ -641,6 +641,9 @@ 
>>>>>>>> DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
>>>>>>>>             DROP_RESET_ACTIVE | \
>>>>>>>>             DROP_RESET_SEQNO | \
>>>>>>>>             DROP_RCU)
>>>>>>>> +
>>>>>>>> +#define DROP_IDLE_TIMEOUT    (HZ * 10)
>>>>>>>
>>>>>>> I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's also 
>>>>>>> only used
>>>>>>> here.
>>>>>>
>>>>>> So move here, dropping i915 prefix, next to the newly proposed one?
>>>>> Sure, can do that.
>>>>>
>>>>>>
>>>>>>> I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only used in
>>>>>>> gt/intel_gt.c.
>>>>>>
>>>>>> Move there and rename to GT_IDLE_TIMEOUT?
>>>>>>
>>>>>>> I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only in 
>>>>>>> intel_gt_pm.c.
>>>>>>
>>>>>> No action needed, maybe drop i915 prefix if wanted.
>>>>>>
>>>>> These two are totally unrelated and in code not being touched by 
>>>>> this change. I would rather not conflate changing random other 
>>>>> things with fixing this specific issue.
>>>>>
>>>>>>> I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies.
>>>>>>
>>>>>> Add _MS suffix if wanted.
>>>>>>
>>>>>>> My head spins.
>>>>>>
>>>>>> I follow and raise that the newly proposed DROP_IDLE_TIMEOUT 
>>>>>> applies to DROP_ACTIVE and not only DROP_IDLE.
>>>>> My original intention for the name was that is the 'drop caches 
>>>>> timeout for intel_gt_wait_for_idle'. Which is quite the mouthful 
>>>>> and hence abbreviated to DROP_IDLE_TIMEOUT. But yes, I realised 
>>>>> later that name can be conflated with the DROP_IDLE flag. Will 
>>>>> rename.
>>>>>
>>>>>
>>>>>>
>>>>>> Things get refactored, code moves around, bits get left behind, 
>>>>>> who knows. No reason to get too worked up. :) As long as people 
>>>>>> are taking a wider view when touching the code base, and are not 
>>>>>> afraid to send cleanups, things should be good.
>>>>> On the other hand, if every patch gets blocked in code review 
>>>>> because someone points out some completely unrelated piece of code 
>>>>> could be a bit better then nothing ever gets fixed. If you spot 
>>>>> something that you think should be improved, isn't the general 
>>>>> idea that you should post a patch yourself to improve it?
>>>>
>>>> There's two maintainers per branch and an order of magnitude or two 
>>>> more developers so it'd be nice if cleanups would just be incoming 
>>>> on self-initiative basis. ;)
>>>>
>>>>>> For the actual functional change at hand - it would be nice if 
>>>>>> code paths in question could handle SIGINT and then we could punt 
>>>>>> the decision on how long someone wants to wait purely to 
>>>>>> userspace. But it's probably hard and it's only debugfs so whatever.
>>>>>>
>>>>> The code paths in question will already abort on a signal won't 
>>>>> they? Both intel_gt_wait_for_idle() and 
>>>>> intel_guc_wait_for_pending_msg(), which is where the 
>>>>> uc_wait_for_idle eventually ends up, have an 'if(signal_pending) 
>>>>> return -EINTR;' check. Beyond that, it sounds like what you are 
>>>>> asking for is a change in the IGT libraries and/or CI framework to 
>>>>> start sending signals after some specific timeout. That seems like 
>>>>> a significantly more complex change (in terms of the number of 
>>>>> entities affected and number of groups involved) and unnecessary.
>>>>
>>>> If you say so, I haven't looked at them all. But if the code path 
>>>> in question already aborts on signals then I am not sure what is 
>>>> the patch fixing? I assumed you are trying to avoid the write stuck 
>>>> in D forever, which then prevents driver unload and everything, 
>>>> requiring the test runner to eventually reboot. If you say SIGINT 
>>>> works then you can already recover from userspace, no?
>>>>
>>>>>> Whether or not 10s is enough CI will hopefully tell us. I'd 
>>>>>> probably err on the side of safety and make it longer, but at 
>>>>>> most half from the test runner timeout.
>>>>> This is supposed to be test clean up. This is not about how long a 
>>>>> particular test takes to complete but about how long it takes to 
>>>>> declare the system broken after the test has already finished. I 
>>>>> would argue that even 10s is massively longer than required.
>>>>>
>>>>>>
>>>>>> I am not convinced that wedging is correct though. Conceptually 
>>>>>> could be just that the timeout is too short. What does wedging 
>>>>>> really give us, on top of limiting the wait, when latter AFAIU is 
>>>>>> the key factor which would prevent the need to reboot the machine?
>>>>>>
>>>>> It gives us a system that knows what state it is in. If we can't 
>>>>> idle the GT then something is very badly wrong. Wedging indicates 
>>>>> that. It also ensure that a full GT reset will be attempted before 
>>>>> the next test is run. Helping to prevent a failure on test X from 
>>>>> propagating into failures of unrelated tests X+1, X+2, ... And if 
>>>>> the GT reset does not work because the system is really that badly 
>>>>> broken then future tests will not run rather than report erroneous 
>>>>> failures.
>>>>>
>>>>> This is not about getting a more stable system for end users by 
>>>>> sweeping issues under the carpet and pretending all is well. End 
>>>>> users don't run IGTs or explicitly call dodgy debugfs entry 
>>>>> points. The sole motivation here is to get more accurate results 
>>>>> from CI. That is, correctly identifying which test has hit a 
>>>>> problem, getting valid debug analysis for that test (logs and 
>>>>> such) and allowing further testing to complete correctly in the 
>>>>> case where the system can be recovered.
>>>>
>>>> I don't really oppose shortening of the timeout in principle, just 
>>>> want a clear statement if this is something IGT / test runner could 
>>>> already do or not. It can apply a timeout, it can also send SIGINT, 
>>>> and it could even trigger a reset from outside. Sure it is debugfs 
>>>> hacks so general "kernel should not implement policy" need not be 
>>>> strictly followed, but lets have it clear what are the options.
>>>
>>> One conceptual problem with applying this policy is that the code is:
>>>
>>>     if (val & (DROP_IDLE | DROP_ACTIVE)) {
>>>         ret = intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
>>>         if (ret)
>>>             return ret;
>>>     }
>>>
>>>     if (val & DROP_IDLE) {
>>>         ret = intel_gt_pm_wait_for_idle(gt);
>>>         if (ret)
>>>             return ret;
>>>     }
>>>
>>> So if someone passes in DROP_IDLE and then why would only the first 
>>> branch have a short timeout and wedge. Yeah some bug happens to be 
>>> there at the moment, but put a bug in a different place and you hang 
>>> on the second branch and then need another patch. Versus perhaps 
>>> making it all respect SIGINT and handle from outside.
>>>
>> The pm_wait_for_idle is can only called after gt_wait_for_idle has 
>> completed successfully. There is no route to skip the GT idle or to 
>> do the PM idle even if the GT idle fails. So the chances of the PM 
>> idle failing are greatly reduced. There would have to be something 
>> outside of a GT keeping the GPU awake and there isn't a whole lot of 
>> hardware left at that point!
>
> Well "greatly reduced" is beside my point. Point is today bug is here 
> and we add a timeout, tomorrow bug is there and then the same dance. 
> It can be just a sw bug which forgets to release the pm ref in some 
> circumstances, doesn't really matter.
>
Huh?

Greatly reduced is the whole point. Today there is a bug and it causes a 
kernel hang which requires the CI framework to reboot the system in an 
extremely unfriendly way which makes it very hard to work out what 
happened. Logs are likely not available. We don't even necessarily know 
which test was being run at the time. Etc. So we replace the infinite 
timeout with a meaningful timeout. CI now correctly marks the single 
test as failing, captures all the correct logs, creates a useful bug 
report and continues on testing more stuff.

Sure, there is still the chance of hitting an infinite timeout. But that 
one is significantly more complicated to remove. And the chances of 
hitting that one are significantly smaller than the chances of hitting 
the first one.

So you are arguing that because I can't fix the last 0.1% of possible 
failures, I am not allowed to fix the first 99.9% of the failures?


>> Regarding signals, the PM idle code ends up at 
>> wait_var_event_killable(). I assume that is interruptible via at 
>> least a KILL signal if not any signal. Although it's not entirely 
>> clear trying to follow through the implementation of this code. Also, 
>> I have no idea if there is a safe way to add a timeout to that code 
>> (or why it wasn't already written with a timeout included). Someone 
>> more familiar with the wakeref internals would need to comment.
>>
>> However, I strongly disagree that we should not fix the driver just 
>> because it is possible to workaround the issue by re-writing the CI 
>> framework. Feel free to bring a redesign plan to the IGT WG and 
>> whatever equivalent CI meetings in parallel. But we absolutely should 
>> not have infinite waits in the kernel if there is a trivial way to 
>> not have infinite waits.
>
> I thought I was clear that I am not really opposed to the timeout.
>
> The rest of the paragraph I don't really care - point is moot because 
> it's debugfs so we can do whatever, as long as it is not burdensome to 
> i915, which this isn't. If either wasn't the case then we certainly 
> wouldn't be adding any workarounds in the kernel if it can be achieved 
> in IGT.
>
>> Also, sending a signal does not result in the wedge happening. I 
>> specifically did not want to change that code path because I was 
>> assuming there was a valid reason for it. If you have been 
>> interrupted then you are in the territory of maybe it would have 
>> succeeded if you just left it for a moment longer. Whereas, hitting 
>> the timeout says that someone very deliberately said this is too long 
>> to wait and therefore the system must be broken.
>
> I wanted to know specifically about wedging - why can't you 
> wedge/reset from IGT if DROP_IDLE times out in quiescent or wherever, 
> if that's what you say is the right thing? 
Huh?

DROP_IDLE has two waits. One that I am trying to change from infinite to 
finite + wedge. One that would take considerable effort to change and 
would be quite invasive to a lot more of the driver and which can only 
be hit if the first timeout actually completed successfully and is 
therefore of less importance anyway. Both of those time outs appear to 
respect signal interrupts.

> That's a policy decision so why would i915 wedge if an arbitrary 
> timeout expired? I915 is not controlling how much work there is 
> outstanding at the point the IGT decides to call DROP_IDLE.

Because this is a debug test interface that is used solely by IGT after 
it has finished its testing. This is not about wedging the device at 
some random arbitrary point because an AI compute workload takes three 
hours to complete. This is about a very specific test framework cleaning 
up after testing is completed and making sure the test did not fry the 
system.

And even if an IGT test was calling DROP_IDLE in the middle of a test 
for some reason, it should not be deliberately pushing 10+ seconds of 
work through and then calling a debug only interface to flush it out. If 
a test wants to verify that the system can cope with submitting a 
minutes worth of rendering and then waiting for it to complete then the 
test should be using official channels for that wait.

>
>> Plus, infinite wait is not a valid code path in the first place so 
>> any change in behaviour is not really a change in behaviour. Code 
>> can't be relying on a kernel call to never return for its correct 
>> operation!
>
> Why infinite wait wouldn't be valid? Then you better change the other 
> one as well. ;P
In what universe is it ever valid to wait forever for a test to complete?

See above, the PM code would require much more invasive changes. This 
was low hanging fruit. It was supposed to be a two minute change to a 
very self contained section of code that would provide significant 
benefit to debugging a small class of very hard to debug problems.

John.


>
> Regards,
>
> Tvrtko
>
>> And if you don't wedge then you don't recover. Each subsequent test 
>> would just hit the infinite timeout, get killed by the CI framework's 
>> shiny new kill signal and be marked as yet another unrelated bug that 
>> will be logged separately. Whereas, using a sensible timeout and then 
>> wedging will at least attempt to recover the situation. And if it can 
>> be recovered, future tests will pass. If it can't then future testing 
>> will be aborted.
>>
>> John.
>>
>>
>>> Regards,
>>>
>>> Tvrtko
>>
Tvrtko Ursulin Nov. 7, 2022, 2:09 p.m. UTC | #12
On 04/11/2022 17:45, John Harrison wrote:
> On 11/4/2022 03:01, Tvrtko Ursulin wrote:
>> On 03/11/2022 19:16, John Harrison wrote:
>>> On 11/3/2022 02:38, Tvrtko Ursulin wrote:
>>>> On 03/11/2022 09:18, Tvrtko Ursulin wrote:
>>>>> On 03/11/2022 01:33, John Harrison wrote:
>>>>>> On 11/2/2022 07:20, Tvrtko Ursulin wrote:
>>>>>>> On 02/11/2022 12:12, Jani Nikula wrote:
>>>>>>>> On Tue, 01 Nov 2022, John.C.Harrison@Intel.com wrote:
>>>>>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>>>>>
>>>>>>>>> At the end of each test, IGT does a drop caches call via sysfs 
>>>>>>>>> with
>>>>>>>>
>>>>>>>> sysfs?
>>>>>> Sorry, that was meant to say debugfs. I've also been working on 
>>>>>> some sysfs IGT issues and evidently got my wires crossed!
>>>>>>
>>>>>>>>
>>>>>>>>> special flags set. One of the possible paths waits for idle 
>>>>>>>>> with an
>>>>>>>>> infinite timeout. That causes problems for debugging issues 
>>>>>>>>> when CI
>>>>>>>>> catches a "can't go idle" test failure. Best case, the CI 
>>>>>>>>> system times
>>>>>>>>> out (after 90s), attempts a bunch of state dump actions and then
>>>>>>>>> reboots the system to recover it. Worst case, the CI system 
>>>>>>>>> can't do
>>>>>>>>> anything at all and then times out (after 1000s) and simply 
>>>>>>>>> reboots.
>>>>>>>>> Sometimes a serial port log of dmesg might be available, 
>>>>>>>>> sometimes not.
>>>>>>>>>
>>>>>>>>> So rather than making life hard for ourselves, change the 
>>>>>>>>> timeout to
>>>>>>>>> be 10s rather than infinite. Also, trigger the standard
>>>>>>>>> wedge/reset/recover sequence so that testing can continue with a
>>>>>>>>> working system (if possible).
>>>>>>>>>
>>>>>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>>>>>>> ---
>>>>>>>>>   drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
>>>>>>>>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c 
>>>>>>>>> b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>>> index ae987e92251dd..9d916fbbfc27c 100644
>>>>>>>>> --- a/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>>> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>>> @@ -641,6 +641,9 @@ 
>>>>>>>>> DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
>>>>>>>>>             DROP_RESET_ACTIVE | \
>>>>>>>>>             DROP_RESET_SEQNO | \
>>>>>>>>>             DROP_RCU)
>>>>>>>>> +
>>>>>>>>> +#define DROP_IDLE_TIMEOUT    (HZ * 10)
>>>>>>>>
>>>>>>>> I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's also 
>>>>>>>> only used
>>>>>>>> here.
>>>>>>>
>>>>>>> So move here, dropping i915 prefix, next to the newly proposed one?
>>>>>> Sure, can do that.
>>>>>>
>>>>>>>
>>>>>>>> I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only used in
>>>>>>>> gt/intel_gt.c.
>>>>>>>
>>>>>>> Move there and rename to GT_IDLE_TIMEOUT?
>>>>>>>
>>>>>>>> I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only in 
>>>>>>>> intel_gt_pm.c.
>>>>>>>
>>>>>>> No action needed, maybe drop i915 prefix if wanted.
>>>>>>>
>>>>>> These two are totally unrelated and in code not being touched by 
>>>>>> this change. I would rather not conflate changing random other 
>>>>>> things with fixing this specific issue.
>>>>>>
>>>>>>>> I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies.
>>>>>>>
>>>>>>> Add _MS suffix if wanted.
>>>>>>>
>>>>>>>> My head spins.
>>>>>>>
>>>>>>> I follow and raise that the newly proposed DROP_IDLE_TIMEOUT 
>>>>>>> applies to DROP_ACTIVE and not only DROP_IDLE.
>>>>>> My original intention for the name was that is the 'drop caches 
>>>>>> timeout for intel_gt_wait_for_idle'. Which is quite the mouthful 
>>>>>> and hence abbreviated to DROP_IDLE_TIMEOUT. But yes, I realised 
>>>>>> later that name can be conflated with the DROP_IDLE flag. Will 
>>>>>> rename.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Things get refactored, code moves around, bits get left behind, 
>>>>>>> who knows. No reason to get too worked up. :) As long as people 
>>>>>>> are taking a wider view when touching the code base, and are not 
>>>>>>> afraid to send cleanups, things should be good.
>>>>>> On the other hand, if every patch gets blocked in code review 
>>>>>> because someone points out some completely unrelated piece of code 
>>>>>> could be a bit better then nothing ever gets fixed. If you spot 
>>>>>> something that you think should be improved, isn't the general 
>>>>>> idea that you should post a patch yourself to improve it?
>>>>>
>>>>> There's two maintainers per branch and an order of magnitude or two 
>>>>> more developers so it'd be nice if cleanups would just be incoming 
>>>>> on self-initiative basis. ;)
>>>>>
>>>>>>> For the actual functional change at hand - it would be nice if 
>>>>>>> code paths in question could handle SIGINT and then we could punt 
>>>>>>> the decision on how long someone wants to wait purely to 
>>>>>>> userspace. But it's probably hard and it's only debugfs so whatever.
>>>>>>>
>>>>>> The code paths in question will already abort on a signal won't 
>>>>>> they? Both intel_gt_wait_for_idle() and 
>>>>>> intel_guc_wait_for_pending_msg(), which is where the 
>>>>>> uc_wait_for_idle eventually ends up, have an 'if(signal_pending) 
>>>>>> return -EINTR;' check. Beyond that, it sounds like what you are 
>>>>>> asking for is a change in the IGT libraries and/or CI framework to 
>>>>>> start sending signals after some specific timeout. That seems like 
>>>>>> a significantly more complex change (in terms of the number of 
>>>>>> entities affected and number of groups involved) and unnecessary.
>>>>>
>>>>> If you say so, I haven't looked at them all. But if the code path 
>>>>> in question already aborts on signals then I am not sure what is 
>>>>> the patch fixing? I assumed you are trying to avoid the write stuck 
>>>>> in D forever, which then prevents driver unload and everything, 
>>>>> requiring the test runner to eventually reboot. If you say SIGINT 
>>>>> works then you can already recover from userspace, no?
>>>>>
>>>>>>> Whether or not 10s is enough CI will hopefully tell us. I'd 
>>>>>>> probably err on the side of safety and make it longer, but at 
>>>>>>> most half from the test runner timeout.
>>>>>> This is supposed to be test clean up. This is not about how long a 
>>>>>> particular test takes to complete but about how long it takes to 
>>>>>> declare the system broken after the test has already finished. I 
>>>>>> would argue that even 10s is massively longer than required.
>>>>>>
>>>>>>>
>>>>>>> I am not convinced that wedging is correct though. Conceptually 
>>>>>>> could be just that the timeout is too short. What does wedging 
>>>>>>> really give us, on top of limiting the wait, when latter AFAIU is 
>>>>>>> the key factor which would prevent the need to reboot the machine?
>>>>>>>
>>>>>> It gives us a system that knows what state it is in. If we can't 
>>>>>> idle the GT then something is very badly wrong. Wedging indicates 
>>>>>> that. It also ensure that a full GT reset will be attempted before 
>>>>>> the next test is run. Helping to prevent a failure on test X from 
>>>>>> propagating into failures of unrelated tests X+1, X+2, ... And if 
>>>>>> the GT reset does not work because the system is really that badly 
>>>>>> broken then future tests will not run rather than report erroneous 
>>>>>> failures.
>>>>>>
>>>>>> This is not about getting a more stable system for end users by 
>>>>>> sweeping issues under the carpet and pretending all is well. End 
>>>>>> users don't run IGTs or explicitly call dodgy debugfs entry 
>>>>>> points. The sole motivation here is to get more accurate results 
>>>>>> from CI. That is, correctly identifying which test has hit a 
>>>>>> problem, getting valid debug analysis for that test (logs and 
>>>>>> such) and allowing further testing to complete correctly in the 
>>>>>> case where the system can be recovered.
>>>>>
>>>>> I don't really oppose shortening of the timeout in principle, just 
>>>>> want a clear statement if this is something IGT / test runner could 
>>>>> already do or not. It can apply a timeout, it can also send SIGINT, 
>>>>> and it could even trigger a reset from outside. Sure it is debugfs 
>>>>> hacks so general "kernel should not implement policy" need not be 
>>>>> strictly followed, but lets have it clear what are the options.
>>>>
>>>> One conceptual problem with applying this policy is that the code is:
>>>>
>>>>     if (val & (DROP_IDLE | DROP_ACTIVE)) {
>>>>         ret = intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
>>>>         if (ret)
>>>>             return ret;
>>>>     }
>>>>
>>>>     if (val & DROP_IDLE) {
>>>>         ret = intel_gt_pm_wait_for_idle(gt);
>>>>         if (ret)
>>>>             return ret;
>>>>     }
>>>>
>>>> So if someone passes in DROP_IDLE and then why would only the first 
>>>> branch have a short timeout and wedge. Yeah some bug happens to be 
>>>> there at the moment, but put a bug in a different place and you hang 
>>>> on the second branch and then need another patch. Versus perhaps 
>>>> making it all respect SIGINT and handle from outside.
>>>>
>>> The pm_wait_for_idle is can only called after gt_wait_for_idle has 
>>> completed successfully. There is no route to skip the GT idle or to 
>>> do the PM idle even if the GT idle fails. So the chances of the PM 
>>> idle failing are greatly reduced. There would have to be something 
>>> outside of a GT keeping the GPU awake and there isn't a whole lot of 
>>> hardware left at that point!
>>
>> Well "greatly reduced" is beside my point. Point is today bug is here 
>> and we add a timeout, tomorrow bug is there and then the same dance. 
>> It can be just a sw bug which forgets to release the pm ref in some 
>> circumstances, doesn't really matter.
>>
> Huh?
> 
> Greatly reduced is the whole point. Today there is a bug and it causes a 
> kernel hang which requires the CI framework to reboot the system in an 
> extremely unfriendly way which makes it very hard to work out what 
> happened. Logs are likely not available. We don't even necessarily know 
> which test was being run at the time. Etc. So we replace the infinite 
> timeout with a meaningful timeout. CI now correctly marks the single 
> test as failing, captures all the correct logs, creates a useful bug 
> report and continues on testing more stuff.

So what is preventing CI to collect logs if IGT is forever stuck in 
interruptible wait? Surely it can collect the logs at that point if the 
kernel is healthy enough. If it isn't then I don't see how wedging the 
GPU will make the kernel any healthier.

Is i915 preventing better log collection or could test runner be improved?

> Sure, there is still the chance of hitting an infinite timeout. But that 
> one is significantly more complicated to remove. And the chances of 
> hitting that one are significantly smaller than the chances of hitting 
> the first one.

This statement relies on intimate knowledge implementation details and a 
bit too much white box testing approach but that's okay, lets move past 
this one.

> So you are arguing that because I can't fix the last 0.1% of possible 
> failures, I am not allowed to fix the first 99.9% of the failures?

I am clearly not arguing for that. But we are also not talking about 
"fixing failures" here. Just how to make CI cope better with a class of 
i915 bugs.

>>> Regarding signals, the PM idle code ends up at 
>>> wait_var_event_killable(). I assume that is interruptible via at 
>>> least a KILL signal if not any signal. Although it's not entirely 
>>> clear trying to follow through the implementation of this code. Also, 
>>> I have no idea if there is a safe way to add a timeout to that code 
>>> (or why it wasn't already written with a timeout included). Someone 
>>> more familiar with the wakeref internals would need to comment.
>>>
>>> However, I strongly disagree that we should not fix the driver just 
>>> because it is possible to workaround the issue by re-writing the CI 
>>> framework. Feel free to bring a redesign plan to the IGT WG and 
>>> whatever equivalent CI meetings in parallel. But we absolutely should 
>>> not have infinite waits in the kernel if there is a trivial way to 
>>> not have infinite waits.
>>
>> I thought I was clear that I am not really opposed to the timeout.
>>
>> The rest of the paragraph I don't really care - point is moot because 
>> it's debugfs so we can do whatever, as long as it is not burdensome to 
>> i915, which this isn't. If either wasn't the case then we certainly 
>> wouldn't be adding any workarounds in the kernel if it can be achieved 
>> in IGT.
>>
>>> Also, sending a signal does not result in the wedge happening. I 
>>> specifically did not want to change that code path because I was 
>>> assuming there was a valid reason for it. If you have been 
>>> interrupted then you are in the territory of maybe it would have 
>>> succeeded if you just left it for a moment longer. Whereas, hitting 
>>> the timeout says that someone very deliberately said this is too long 
>>> to wait and therefore the system must be broken.
>>
>> I wanted to know specifically about wedging - why can't you 
>> wedge/reset from IGT if DROP_IDLE times out in quiescent or wherever, 
>> if that's what you say is the right thing? 
> Huh?
> 
> DROP_IDLE has two waits. One that I am trying to change from infinite to 
> finite + wedge. One that would take considerable effort to change and 
> would be quite invasive to a lot more of the driver and which can only 
> be hit if the first timeout actually completed successfully and is 
> therefore of less importance anyway. Both of those time outs appear to 
> respect signal interrupts.
> 
>> That's a policy decision so why would i915 wedge if an arbitrary 
>> timeout expired? I915 is not controlling how much work there is 
>> outstanding at the point the IGT decides to call DROP_IDLE.
> 
> Because this is a debug test interface that is used solely by IGT after 
> it has finished its testing. This is not about wedging the device at 
> some random arbitrary point because an AI compute workload takes three 
> hours to complete. This is about a very specific test framework cleaning 
> up after testing is completed and making sure the test did not fry the 
> system.
> 
> And even if an IGT test was calling DROP_IDLE in the middle of a test 
> for some reason, it should not be deliberately pushing 10+ seconds of 
> work through and then calling a debug only interface to flush it out. If 
> a test wants to verify that the system can cope with submitting a 
> minutes worth of rendering and then waiting for it to complete then the 
> test should be using official channels for that wait.
> 
>>
>>> Plus, infinite wait is not a valid code path in the first place so 
>>> any change in behaviour is not really a change in behaviour. Code 
>>> can't be relying on a kernel call to never return for its correct 
>>> operation!
>>
>> Why infinite wait wouldn't be valid? Then you better change the other 
>> one as well. ;P
> In what universe is it ever valid to wait forever for a test to complete?

Well above you claimed both paths respect SIGINT. If that is so then the 
wait is as infinite as the IGT wanted it to be.

> See above, the PM code would require much more invasive changes. This 
> was low hanging fruit. It was supposed to be a two minute change to a 
> very self contained section of code that would provide significant 
> benefit to debugging a small class of very hard to debug problems.

Sure, but I'd still like to know why can't you do what you want from the 
IGT framework.

Have the timeout reduction in i915, again that's fine assuming 10 
seconds it enough to not break something by accident.

With that change you already have broken the "infinite wait". It makes 
the debugfs write return -ETIME in time much shorter than the test 
runner timeout(s). What is the thing that you cannot do from IGT at that 
point is my question? You want to wedge then? Send DROP_RESET_ACTIVE to 
do it for you? If that doesn't work add a new flag which will wedge 
explicitly.

We are again degrading into a huge philosophical discussion and all I 
wanted to start with is to hear how exactly things go bad.

Regards,

Tvrtko
John Harrison Nov. 7, 2022, 7:45 p.m. UTC | #13
On 11/7/2022 06:09, Tvrtko Ursulin wrote:
> On 04/11/2022 17:45, John Harrison wrote:
>> On 11/4/2022 03:01, Tvrtko Ursulin wrote:
>>> On 03/11/2022 19:16, John Harrison wrote:
>>>> On 11/3/2022 02:38, Tvrtko Ursulin wrote:
>>>>> On 03/11/2022 09:18, Tvrtko Ursulin wrote:
>>>>>> On 03/11/2022 01:33, John Harrison wrote:
>>>>>>> On 11/2/2022 07:20, Tvrtko Ursulin wrote:
>>>>>>>> On 02/11/2022 12:12, Jani Nikula wrote:
>>>>>>>>> On Tue, 01 Nov 2022, John.C.Harrison@Intel.com wrote:
>>>>>>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>>>>>>
>>>>>>>>>> At the end of each test, IGT does a drop caches call via 
>>>>>>>>>> sysfs with
>>>>>>>>>
>>>>>>>>> sysfs?
>>>>>>> Sorry, that was meant to say debugfs. I've also been working on 
>>>>>>> some sysfs IGT issues and evidently got my wires crossed!
>>>>>>>
>>>>>>>>>
>>>>>>>>>> special flags set. One of the possible paths waits for idle 
>>>>>>>>>> with an
>>>>>>>>>> infinite timeout. That causes problems for debugging issues 
>>>>>>>>>> when CI
>>>>>>>>>> catches a "can't go idle" test failure. Best case, the CI 
>>>>>>>>>> system times
>>>>>>>>>> out (after 90s), attempts a bunch of state dump actions and then
>>>>>>>>>> reboots the system to recover it. Worst case, the CI system 
>>>>>>>>>> can't do
>>>>>>>>>> anything at all and then times out (after 1000s) and simply 
>>>>>>>>>> reboots.
>>>>>>>>>> Sometimes a serial port log of dmesg might be available, 
>>>>>>>>>> sometimes not.
>>>>>>>>>>
>>>>>>>>>> So rather than making life hard for ourselves, change the 
>>>>>>>>>> timeout to
>>>>>>>>>> be 10s rather than infinite. Also, trigger the standard
>>>>>>>>>> wedge/reset/recover sequence so that testing can continue with a
>>>>>>>>>> working system (if possible).
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>>>>>>>> ---
>>>>>>>>>>   drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
>>>>>>>>>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c 
>>>>>>>>>> b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>>>> index ae987e92251dd..9d916fbbfc27c 100644
>>>>>>>>>> --- a/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>>>> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>>>> @@ -641,6 +641,9 @@ 
>>>>>>>>>> DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
>>>>>>>>>>             DROP_RESET_ACTIVE | \
>>>>>>>>>>             DROP_RESET_SEQNO | \
>>>>>>>>>>             DROP_RCU)
>>>>>>>>>> +
>>>>>>>>>> +#define DROP_IDLE_TIMEOUT    (HZ * 10)
>>>>>>>>>
>>>>>>>>> I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's also 
>>>>>>>>> only used
>>>>>>>>> here.
>>>>>>>>
>>>>>>>> So move here, dropping i915 prefix, next to the newly proposed 
>>>>>>>> one?
>>>>>>> Sure, can do that.
>>>>>>>
>>>>>>>>
>>>>>>>>> I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only used in
>>>>>>>>> gt/intel_gt.c.
>>>>>>>>
>>>>>>>> Move there and rename to GT_IDLE_TIMEOUT?
>>>>>>>>
>>>>>>>>> I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only in 
>>>>>>>>> intel_gt_pm.c.
>>>>>>>>
>>>>>>>> No action needed, maybe drop i915 prefix if wanted.
>>>>>>>>
>>>>>>> These two are totally unrelated and in code not being touched by 
>>>>>>> this change. I would rather not conflate changing random other 
>>>>>>> things with fixing this specific issue.
>>>>>>>
>>>>>>>>> I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies.
>>>>>>>>
>>>>>>>> Add _MS suffix if wanted.
>>>>>>>>
>>>>>>>>> My head spins.
>>>>>>>>
>>>>>>>> I follow and raise that the newly proposed DROP_IDLE_TIMEOUT 
>>>>>>>> applies to DROP_ACTIVE and not only DROP_IDLE.
>>>>>>> My original intention for the name was that is the 'drop caches 
>>>>>>> timeout for intel_gt_wait_for_idle'. Which is quite the mouthful 
>>>>>>> and hence abbreviated to DROP_IDLE_TIMEOUT. But yes, I realised 
>>>>>>> later that name can be conflated with the DROP_IDLE flag. Will 
>>>>>>> rename.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Things get refactored, code moves around, bits get left behind, 
>>>>>>>> who knows. No reason to get too worked up. :) As long as people 
>>>>>>>> are taking a wider view when touching the code base, and are 
>>>>>>>> not afraid to send cleanups, things should be good.
>>>>>>> On the other hand, if every patch gets blocked in code review 
>>>>>>> because someone points out some completely unrelated piece of 
>>>>>>> code could be a bit better then nothing ever gets fixed. If you 
>>>>>>> spot something that you think should be improved, isn't the 
>>>>>>> general idea that you should post a patch yourself to improve it?
>>>>>>
>>>>>> There's two maintainers per branch and an order of magnitude or 
>>>>>> two more developers so it'd be nice if cleanups would just be 
>>>>>> incoming on self-initiative basis. ;)
>>>>>>
>>>>>>>> For the actual functional change at hand - it would be nice if 
>>>>>>>> code paths in question could handle SIGINT and then we could 
>>>>>>>> punt the decision on how long someone wants to wait purely to 
>>>>>>>> userspace. But it's probably hard and it's only debugfs so 
>>>>>>>> whatever.
>>>>>>>>
>>>>>>> The code paths in question will already abort on a signal won't 
>>>>>>> they? Both intel_gt_wait_for_idle() and 
>>>>>>> intel_guc_wait_for_pending_msg(), which is where the 
>>>>>>> uc_wait_for_idle eventually ends up, have an 'if(signal_pending) 
>>>>>>> return -EINTR;' check. Beyond that, it sounds like what you are 
>>>>>>> asking for is a change in the IGT libraries and/or CI framework 
>>>>>>> to start sending signals after some specific timeout. That seems 
>>>>>>> like a significantly more complex change (in terms of the number 
>>>>>>> of entities affected and number of groups involved) and 
>>>>>>> unnecessary.
>>>>>>
>>>>>> If you say so, I haven't looked at them all. But if the code path 
>>>>>> in question already aborts on signals then I am not sure what is 
>>>>>> the patch fixing? I assumed you are trying to avoid the write 
>>>>>> stuck in D forever, which then prevents driver unload and 
>>>>>> everything, requiring the test runner to eventually reboot. If 
>>>>>> you say SIGINT works then you can already recover from userspace, 
>>>>>> no?
>>>>>>
>>>>>>>> Whether or not 10s is enough CI will hopefully tell us. I'd 
>>>>>>>> probably err on the side of safety and make it longer, but at 
>>>>>>>> most half from the test runner timeout.
>>>>>>> This is supposed to be test clean up. This is not about how long 
>>>>>>> a particular test takes to complete but about how long it takes 
>>>>>>> to declare the system broken after the test has already 
>>>>>>> finished. I would argue that even 10s is massively longer than 
>>>>>>> required.
>>>>>>>
>>>>>>>>
>>>>>>>> I am not convinced that wedging is correct though. Conceptually 
>>>>>>>> could be just that the timeout is too short. What does wedging 
>>>>>>>> really give us, on top of limiting the wait, when latter AFAIU 
>>>>>>>> is the key factor which would prevent the need to reboot the 
>>>>>>>> machine?
>>>>>>>>
>>>>>>> It gives us a system that knows what state it is in. If we can't 
>>>>>>> idle the GT then something is very badly wrong. Wedging 
>>>>>>> indicates that. It also ensure that a full GT reset will be 
>>>>>>> attempted before the next test is run. Helping to prevent a 
>>>>>>> failure on test X from propagating into failures of unrelated 
>>>>>>> tests X+1, X+2, ... And if the GT reset does not work because 
>>>>>>> the system is really that badly broken then future tests will 
>>>>>>> not run rather than report erroneous failures.
>>>>>>>
>>>>>>> This is not about getting a more stable system for end users by 
>>>>>>> sweeping issues under the carpet and pretending all is well. End 
>>>>>>> users don't run IGTs or explicitly call dodgy debugfs entry 
>>>>>>> points. The sole motivation here is to get more accurate results 
>>>>>>> from CI. That is, correctly identifying which test has hit a 
>>>>>>> problem, getting valid debug analysis for that test (logs and 
>>>>>>> such) and allowing further testing to complete correctly in the 
>>>>>>> case where the system can be recovered.
>>>>>>
>>>>>> I don't really oppose shortening of the timeout in principle, 
>>>>>> just want a clear statement if this is something IGT / test 
>>>>>> runner could already do or not. It can apply a timeout, it can 
>>>>>> also send SIGINT, and it could even trigger a reset from outside. 
>>>>>> Sure it is debugfs hacks so general "kernel should not implement 
>>>>>> policy" need not be strictly followed, but lets have it clear 
>>>>>> what are the options.
>>>>>
>>>>> One conceptual problem with applying this policy is that the code is:
>>>>>
>>>>>     if (val & (DROP_IDLE | DROP_ACTIVE)) {
>>>>>         ret = intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
>>>>>         if (ret)
>>>>>             return ret;
>>>>>     }
>>>>>
>>>>>     if (val & DROP_IDLE) {
>>>>>         ret = intel_gt_pm_wait_for_idle(gt);
>>>>>         if (ret)
>>>>>             return ret;
>>>>>     }
>>>>>
>>>>> So if someone passes in DROP_IDLE and then why would only the 
>>>>> first branch have a short timeout and wedge. Yeah some bug happens 
>>>>> to be there at the moment, but put a bug in a different place and 
>>>>> you hang on the second branch and then need another patch. Versus 
>>>>> perhaps making it all respect SIGINT and handle from outside.
>>>>>
>>>> The pm_wait_for_idle is can only called after gt_wait_for_idle has 
>>>> completed successfully. There is no route to skip the GT idle or to 
>>>> do the PM idle even if the GT idle fails. So the chances of the PM 
>>>> idle failing are greatly reduced. There would have to be something 
>>>> outside of a GT keeping the GPU awake and there isn't a whole lot 
>>>> of hardware left at that point!
>>>
>>> Well "greatly reduced" is beside my point. Point is today bug is 
>>> here and we add a timeout, tomorrow bug is there and then the same 
>>> dance. It can be just a sw bug which forgets to release the pm ref 
>>> in some circumstances, doesn't really matter.
>>>
>> Huh?
>>
>> Greatly reduced is the whole point. Today there is a bug and it 
>> causes a kernel hang which requires the CI framework to reboot the 
>> system in an extremely unfriendly way which makes it very hard to 
>> work out what happened. Logs are likely not available. We don't even 
>> necessarily know which test was being run at the time. Etc. So we 
>> replace the infinite timeout with a meaningful timeout. CI now 
>> correctly marks the single test as failing, captures all the correct 
>> logs, creates a useful bug report and continues on testing more stuff.
>
> So what is preventing CI to collect logs if IGT is forever stuck in 
> interruptible wait? Surely it can collect the logs at that point if 
> the kernel is healthy enough. If it isn't then I don't see how wedging 
> the GPU will make the kernel any healthier.
>
> Is i915 preventing better log collection or could test runner be 
> improved?
>
>> Sure, there is still the chance of hitting an infinite timeout. But 
>> that one is significantly more complicated to remove. And the chances 
>> of hitting that one are significantly smaller than the chances of 
>> hitting the first one.
>
> This statement relies on intimate knowledge implementation details and 
> a bit too much white box testing approach but that's okay, lets move 
> past this one.
>
>> So you are arguing that because I can't fix the last 0.1% of possible 
>> failures, I am not allowed to fix the first 99.9% of the failures?
>
> I am clearly not arguing for that. But we are also not talking about 
> "fixing failures" here. Just how to make CI cope better with a class 
> of i915 bugs.
>
>>>> Regarding signals, the PM idle code ends up at 
>>>> wait_var_event_killable(). I assume that is interruptible via at 
>>>> least a KILL signal if not any signal. Although it's not entirely 
>>>> clear trying to follow through the implementation of this code. 
>>>> Also, I have no idea if there is a safe way to add a timeout to 
>>>> that code (or why it wasn't already written with a timeout 
>>>> included). Someone more familiar with the wakeref internals would 
>>>> need to comment.
>>>>
>>>> However, I strongly disagree that we should not fix the driver just 
>>>> because it is possible to workaround the issue by re-writing the CI 
>>>> framework. Feel free to bring a redesign plan to the IGT WG and 
>>>> whatever equivalent CI meetings in parallel. But we absolutely 
>>>> should not have infinite waits in the kernel if there is a trivial 
>>>> way to not have infinite waits.
>>>
>>> I thought I was clear that I am not really opposed to the timeout.
>>>
>>> The rest of the paragraph I don't really care - point is moot 
>>> because it's debugfs so we can do whatever, as long as it is not 
>>> burdensome to i915, which this isn't. If either wasn't the case then 
>>> we certainly wouldn't be adding any workarounds in the kernel if it 
>>> can be achieved in IGT.
>>>
>>>> Also, sending a signal does not result in the wedge happening. I 
>>>> specifically did not want to change that code path because I was 
>>>> assuming there was a valid reason for it. If you have been 
>>>> interrupted then you are in the territory of maybe it would have 
>>>> succeeded if you just left it for a moment longer. Whereas, hitting 
>>>> the timeout says that someone very deliberately said this is too 
>>>> long to wait and therefore the system must be broken.
>>>
>>> I wanted to know specifically about wedging - why can't you 
>>> wedge/reset from IGT if DROP_IDLE times out in quiescent or 
>>> wherever, if that's what you say is the right thing? 
>> Huh?
>>
>> DROP_IDLE has two waits. One that I am trying to change from infinite 
>> to finite + wedge. One that would take considerable effort to change 
>> and would be quite invasive to a lot more of the driver and which can 
>> only be hit if the first timeout actually completed successfully and 
>> is therefore of less importance anyway. Both of those time outs 
>> appear to respect signal interrupts.
>>
>>> That's a policy decision so why would i915 wedge if an arbitrary 
>>> timeout expired? I915 is not controlling how much work there is 
>>> outstanding at the point the IGT decides to call DROP_IDLE.
>>
>> Because this is a debug test interface that is used solely by IGT 
>> after it has finished its testing. This is not about wedging the 
>> device at some random arbitrary point because an AI compute workload 
>> takes three hours to complete. This is about a very specific test 
>> framework cleaning up after testing is completed and making sure the 
>> test did not fry the system.
>>
>> And even if an IGT test was calling DROP_IDLE in the middle of a test 
>> for some reason, it should not be deliberately pushing 10+ seconds of 
>> work through and then calling a debug only interface to flush it out. 
>> If a test wants to verify that the system can cope with submitting a 
>> minutes worth of rendering and then waiting for it to complete then 
>> the test should be using official channels for that wait.
>>
>>>
>>>> Plus, infinite wait is not a valid code path in the first place so 
>>>> any change in behaviour is not really a change in behaviour. Code 
>>>> can't be relying on a kernel call to never return for its correct 
>>>> operation!
>>>
>>> Why infinite wait wouldn't be valid? Then you better change the 
>>> other one as well. ;P
>> In what universe is it ever valid to wait forever for a test to 
>> complete?
>
> Well above you claimed both paths respect SIGINT. If that is so then 
> the wait is as infinite as the IGT wanted it to be.
>
>> See above, the PM code would require much more invasive changes. This 
>> was low hanging fruit. It was supposed to be a two minute change to a 
>> very self contained section of code that would provide significant 
>> benefit to debugging a small class of very hard to debug problems.
>
> Sure, but I'd still like to know why can't you do what you want from 
> the IGT framework.
>
> Have the timeout reduction in i915, again that's fine assuming 10 
> seconds it enough to not break something by accident.
CI showed no regressions. And if someone does find a valid reason why a 
post test drop caches call should legitimately take a stupidly long time 
then it is easy to track back where the ETIME error came from and bump 
the timeout.

>
> With that change you already have broken the "infinite wait". It makes 
> the debugfs write return -ETIME in time much shorter than the test 
> runner timeout(s). What is the thing that you cannot do from IGT at 
> that point is my question? You want to wedge then? Send 
> DROP_RESET_ACTIVE to do it for you? If that doesn't work add a new 
> flag which will wedge explicitly.
>
> We are again degrading into a huge philosophical discussion and all I 
> wanted to start with is to hear how exactly things go bad.
>
I have no idea what you are wanting. I am trying to have a technical 
discussion about improving the stability of the driver during CI 
testing. I have no idea if you are arguing that this change is good, 
bad, broken, wrong direction or what.

Things go bad as explained in the commit message. The CI framework does 
not use signals. The IGT framework does not use signals. There is no 
watchdog that sends a TERM or KILL signal after a specified timeout. All 
that happens is the IGT sits there forever waiting for the drop caches 
IOCTL to return. The CI framework eventually gives up waiting for the 
test to complete and tries to recover. There are many different CI 
frameworks in use across Intel. Some timeout quickly, some timeout 
slowly. But basically, they all eventually give up and don't bother 
trying any kind of remedial action but just hit the reset button 
(sometimes by literally power cycling the DUT). As result, background 
processes that are saving dmesg, stdout, etc do not necessarily 
terminate cleanly. That results in logs that are at best truncated, at 
worst missing entirely. It also results in some frameworks aborting 
testing at that point. So no results are generated for all the other 
tests that have yet to be run. Some frameworks also run tests in 
batches. All they log is that something, somewhere in the batch died. So 
you don't even know which specific test actually hit the problem.

Can the CI frameworks be improved? Undoubtedly. In very many ways. Is 
that something we have the ability to do with a simple patch? No. Would 
re-writing the IGT framework to add watchdog mechanisms improve things? 
Yes. Can it be done with a simple patch? No. Would a simple patch to 
i915 significantly improve the situation? Yes. Will it solve every 
possible CI hang? No. Will it fix any actual end user visible bugs? No. 
Will it introduce any new bugs? No. Will it help us to debug at least 
some CI failures? Yes.

John.

> Regards,
>
> Tvrtko
Tvrtko Ursulin Nov. 8, 2022, 9:08 a.m. UTC | #14
On 07/11/2022 19:45, John Harrison wrote:
> On 11/7/2022 06:09, Tvrtko Ursulin wrote:
>> On 04/11/2022 17:45, John Harrison wrote:
>>> On 11/4/2022 03:01, Tvrtko Ursulin wrote:
>>>> On 03/11/2022 19:16, John Harrison wrote:
>>>>> On 11/3/2022 02:38, Tvrtko Ursulin wrote:
>>>>>> On 03/11/2022 09:18, Tvrtko Ursulin wrote:
>>>>>>> On 03/11/2022 01:33, John Harrison wrote:
>>>>>>>> On 11/2/2022 07:20, Tvrtko Ursulin wrote:
>>>>>>>>> On 02/11/2022 12:12, Jani Nikula wrote:
>>>>>>>>>> On Tue, 01 Nov 2022, John.C.Harrison@Intel.com wrote:
>>>>>>>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>>>>>>>
>>>>>>>>>>> At the end of each test, IGT does a drop caches call via 
>>>>>>>>>>> sysfs with
>>>>>>>>>>
>>>>>>>>>> sysfs?
>>>>>>>> Sorry, that was meant to say debugfs. I've also been working on 
>>>>>>>> some sysfs IGT issues and evidently got my wires crossed!
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> special flags set. One of the possible paths waits for idle 
>>>>>>>>>>> with an
>>>>>>>>>>> infinite timeout. That causes problems for debugging issues 
>>>>>>>>>>> when CI
>>>>>>>>>>> catches a "can't go idle" test failure. Best case, the CI 
>>>>>>>>>>> system times
>>>>>>>>>>> out (after 90s), attempts a bunch of state dump actions and then
>>>>>>>>>>> reboots the system to recover it. Worst case, the CI system 
>>>>>>>>>>> can't do
>>>>>>>>>>> anything at all and then times out (after 1000s) and simply 
>>>>>>>>>>> reboots.
>>>>>>>>>>> Sometimes a serial port log of dmesg might be available, 
>>>>>>>>>>> sometimes not.
>>>>>>>>>>>
>>>>>>>>>>> So rather than making life hard for ourselves, change the 
>>>>>>>>>>> timeout to
>>>>>>>>>>> be 10s rather than infinite. Also, trigger the standard
>>>>>>>>>>> wedge/reset/recover sequence so that testing can continue with a
>>>>>>>>>>> working system (if possible).
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>>>>>>>>> ---
>>>>>>>>>>>   drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
>>>>>>>>>>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c 
>>>>>>>>>>> b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>>>>> index ae987e92251dd..9d916fbbfc27c 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>>>>> @@ -641,6 +641,9 @@ 
>>>>>>>>>>> DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
>>>>>>>>>>>             DROP_RESET_ACTIVE | \
>>>>>>>>>>>             DROP_RESET_SEQNO | \
>>>>>>>>>>>             DROP_RCU)
>>>>>>>>>>> +
>>>>>>>>>>> +#define DROP_IDLE_TIMEOUT    (HZ * 10)
>>>>>>>>>>
>>>>>>>>>> I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's also 
>>>>>>>>>> only used
>>>>>>>>>> here.
>>>>>>>>>
>>>>>>>>> So move here, dropping i915 prefix, next to the newly proposed 
>>>>>>>>> one?
>>>>>>>> Sure, can do that.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only used in
>>>>>>>>>> gt/intel_gt.c.
>>>>>>>>>
>>>>>>>>> Move there and rename to GT_IDLE_TIMEOUT?
>>>>>>>>>
>>>>>>>>>> I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only in 
>>>>>>>>>> intel_gt_pm.c.
>>>>>>>>>
>>>>>>>>> No action needed, maybe drop i915 prefix if wanted.
>>>>>>>>>
>>>>>>>> These two are totally unrelated and in code not being touched by 
>>>>>>>> this change. I would rather not conflate changing random other 
>>>>>>>> things with fixing this specific issue.
>>>>>>>>
>>>>>>>>>> I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies.
>>>>>>>>>
>>>>>>>>> Add _MS suffix if wanted.
>>>>>>>>>
>>>>>>>>>> My head spins.
>>>>>>>>>
>>>>>>>>> I follow and raise that the newly proposed DROP_IDLE_TIMEOUT 
>>>>>>>>> applies to DROP_ACTIVE and not only DROP_IDLE.
>>>>>>>> My original intention for the name was that is the 'drop caches 
>>>>>>>> timeout for intel_gt_wait_for_idle'. Which is quite the mouthful 
>>>>>>>> and hence abbreviated to DROP_IDLE_TIMEOUT. But yes, I realised 
>>>>>>>> later that name can be conflated with the DROP_IDLE flag. Will 
>>>>>>>> rename.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Things get refactored, code moves around, bits get left behind, 
>>>>>>>>> who knows. No reason to get too worked up. :) As long as people 
>>>>>>>>> are taking a wider view when touching the code base, and are 
>>>>>>>>> not afraid to send cleanups, things should be good.
>>>>>>>> On the other hand, if every patch gets blocked in code review 
>>>>>>>> because someone points out some completely unrelated piece of 
>>>>>>>> code could be a bit better then nothing ever gets fixed. If you 
>>>>>>>> spot something that you think should be improved, isn't the 
>>>>>>>> general idea that you should post a patch yourself to improve it?
>>>>>>>
>>>>>>> There's two maintainers per branch and an order of magnitude or 
>>>>>>> two more developers so it'd be nice if cleanups would just be 
>>>>>>> incoming on self-initiative basis. ;)
>>>>>>>
>>>>>>>>> For the actual functional change at hand - it would be nice if 
>>>>>>>>> code paths in question could handle SIGINT and then we could 
>>>>>>>>> punt the decision on how long someone wants to wait purely to 
>>>>>>>>> userspace. But it's probably hard and it's only debugfs so 
>>>>>>>>> whatever.
>>>>>>>>>
>>>>>>>> The code paths in question will already abort on a signal won't 
>>>>>>>> they? Both intel_gt_wait_for_idle() and 
>>>>>>>> intel_guc_wait_for_pending_msg(), which is where the 
>>>>>>>> uc_wait_for_idle eventually ends up, have an 'if(signal_pending) 
>>>>>>>> return -EINTR;' check. Beyond that, it sounds like what you are 
>>>>>>>> asking for is a change in the IGT libraries and/or CI framework 
>>>>>>>> to start sending signals after some specific timeout. That seems 
>>>>>>>> like a significantly more complex change (in terms of the number 
>>>>>>>> of entities affected and number of groups involved) and 
>>>>>>>> unnecessary.
>>>>>>>
>>>>>>> If you say so, I haven't looked at them all. But if the code path 
>>>>>>> in question already aborts on signals then I am not sure what is 
>>>>>>> the patch fixing? I assumed you are trying to avoid the write 
>>>>>>> stuck in D forever, which then prevents driver unload and 
>>>>>>> everything, requiring the test runner to eventually reboot. If 
>>>>>>> you say SIGINT works then you can already recover from userspace, 
>>>>>>> no?
>>>>>>>
>>>>>>>>> Whether or not 10s is enough CI will hopefully tell us. I'd 
>>>>>>>>> probably err on the side of safety and make it longer, but at 
>>>>>>>>> most half from the test runner timeout.
>>>>>>>> This is supposed to be test clean up. This is not about how long 
>>>>>>>> a particular test takes to complete but about how long it takes 
>>>>>>>> to declare the system broken after the test has already 
>>>>>>>> finished. I would argue that even 10s is massively longer than 
>>>>>>>> required.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I am not convinced that wedging is correct though. Conceptually 
>>>>>>>>> could be just that the timeout is too short. What does wedging 
>>>>>>>>> really give us, on top of limiting the wait, when latter AFAIU 
>>>>>>>>> is the key factor which would prevent the need to reboot the 
>>>>>>>>> machine?
>>>>>>>>>
>>>>>>>> It gives us a system that knows what state it is in. If we can't 
>>>>>>>> idle the GT then something is very badly wrong. Wedging 
>>>>>>>> indicates that. It also ensure that a full GT reset will be 
>>>>>>>> attempted before the next test is run. Helping to prevent a 
>>>>>>>> failure on test X from propagating into failures of unrelated 
>>>>>>>> tests X+1, X+2, ... And if the GT reset does not work because 
>>>>>>>> the system is really that badly broken then future tests will 
>>>>>>>> not run rather than report erroneous failures.
>>>>>>>>
>>>>>>>> This is not about getting a more stable system for end users by 
>>>>>>>> sweeping issues under the carpet and pretending all is well. End 
>>>>>>>> users don't run IGTs or explicitly call dodgy debugfs entry 
>>>>>>>> points. The sole motivation here is to get more accurate results 
>>>>>>>> from CI. That is, correctly identifying which test has hit a 
>>>>>>>> problem, getting valid debug analysis for that test (logs and 
>>>>>>>> such) and allowing further testing to complete correctly in the 
>>>>>>>> case where the system can be recovered.
>>>>>>>
>>>>>>> I don't really oppose shortening of the timeout in principle, 
>>>>>>> just want a clear statement if this is something IGT / test 
>>>>>>> runner could already do or not. It can apply a timeout, it can 
>>>>>>> also send SIGINT, and it could even trigger a reset from outside. 
>>>>>>> Sure it is debugfs hacks so general "kernel should not implement 
>>>>>>> policy" need not be strictly followed, but lets have it clear 
>>>>>>> what are the options.
>>>>>>
>>>>>> One conceptual problem with applying this policy is that the code is:
>>>>>>
>>>>>>     if (val & (DROP_IDLE | DROP_ACTIVE)) {
>>>>>>         ret = intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
>>>>>>         if (ret)
>>>>>>             return ret;
>>>>>>     }
>>>>>>
>>>>>>     if (val & DROP_IDLE) {
>>>>>>         ret = intel_gt_pm_wait_for_idle(gt);
>>>>>>         if (ret)
>>>>>>             return ret;
>>>>>>     }
>>>>>>
>>>>>> So if someone passes in DROP_IDLE and then why would only the 
>>>>>> first branch have a short timeout and wedge. Yeah some bug happens 
>>>>>> to be there at the moment, but put a bug in a different place and 
>>>>>> you hang on the second branch and then need another patch. Versus 
>>>>>> perhaps making it all respect SIGINT and handle from outside.
>>>>>>
>>>>> The pm_wait_for_idle is can only called after gt_wait_for_idle has 
>>>>> completed successfully. There is no route to skip the GT idle or to 
>>>>> do the PM idle even if the GT idle fails. So the chances of the PM 
>>>>> idle failing are greatly reduced. There would have to be something 
>>>>> outside of a GT keeping the GPU awake and there isn't a whole lot 
>>>>> of hardware left at that point!
>>>>
>>>> Well "greatly reduced" is beside my point. Point is today bug is 
>>>> here and we add a timeout, tomorrow bug is there and then the same 
>>>> dance. It can be just a sw bug which forgets to release the pm ref 
>>>> in some circumstances, doesn't really matter.
>>>>
>>> Huh?
>>>
>>> Greatly reduced is the whole point. Today there is a bug and it 
>>> causes a kernel hang which requires the CI framework to reboot the 
>>> system in an extremely unfriendly way which makes it very hard to 
>>> work out what happened. Logs are likely not available. We don't even 
>>> necessarily know which test was being run at the time. Etc. So we 
>>> replace the infinite timeout with a meaningful timeout. CI now 
>>> correctly marks the single test as failing, captures all the correct 
>>> logs, creates a useful bug report and continues on testing more stuff.
>>
>> So what is preventing CI to collect logs if IGT is forever stuck in 
>> interruptible wait? Surely it can collect the logs at that point if 
>> the kernel is healthy enough. If it isn't then I don't see how wedging 
>> the GPU will make the kernel any healthier.
>>
>> Is i915 preventing better log collection or could test runner be 
>> improved?
>>
>>> Sure, there is still the chance of hitting an infinite timeout. But 
>>> that one is significantly more complicated to remove. And the chances 
>>> of hitting that one are significantly smaller than the chances of 
>>> hitting the first one.
>>
>> This statement relies on intimate knowledge implementation details and 
>> a bit too much white box testing approach but that's okay, lets move 
>> past this one.
>>
>>> So you are arguing that because I can't fix the last 0.1% of possible 
>>> failures, I am not allowed to fix the first 99.9% of the failures?
>>
>> I am clearly not arguing for that. But we are also not talking about 
>> "fixing failures" here. Just how to make CI cope better with a class 
>> of i915 bugs.
>>
>>>>> Regarding signals, the PM idle code ends up at 
>>>>> wait_var_event_killable(). I assume that is interruptible via at 
>>>>> least a KILL signal if not any signal. Although it's not entirely 
>>>>> clear trying to follow through the implementation of this code. 
>>>>> Also, I have no idea if there is a safe way to add a timeout to 
>>>>> that code (or why it wasn't already written with a timeout 
>>>>> included). Someone more familiar with the wakeref internals would 
>>>>> need to comment.
>>>>>
>>>>> However, I strongly disagree that we should not fix the driver just 
>>>>> because it is possible to workaround the issue by re-writing the CI 
>>>>> framework. Feel free to bring a redesign plan to the IGT WG and 
>>>>> whatever equivalent CI meetings in parallel. But we absolutely 
>>>>> should not have infinite waits in the kernel if there is a trivial 
>>>>> way to not have infinite waits.
>>>>
>>>> I thought I was clear that I am not really opposed to the timeout.
>>>>
>>>> The rest of the paragraph I don't really care - point is moot 
>>>> because it's debugfs so we can do whatever, as long as it is not 
>>>> burdensome to i915, which this isn't. If either wasn't the case then 
>>>> we certainly wouldn't be adding any workarounds in the kernel if it 
>>>> can be achieved in IGT.
>>>>
>>>>> Also, sending a signal does not result in the wedge happening. I 
>>>>> specifically did not want to change that code path because I was 
>>>>> assuming there was a valid reason for it. If you have been 
>>>>> interrupted then you are in the territory of maybe it would have 
>>>>> succeeded if you just left it for a moment longer. Whereas, hitting 
>>>>> the timeout says that someone very deliberately said this is too 
>>>>> long to wait and therefore the system must be broken.
>>>>
>>>> I wanted to know specifically about wedging - why can't you 
>>>> wedge/reset from IGT if DROP_IDLE times out in quiescent or 
>>>> wherever, if that's what you say is the right thing? 
>>> Huh?
>>>
>>> DROP_IDLE has two waits. One that I am trying to change from infinite 
>>> to finite + wedge. One that would take considerable effort to change 
>>> and would be quite invasive to a lot more of the driver and which can 
>>> only be hit if the first timeout actually completed successfully and 
>>> is therefore of less importance anyway. Both of those time outs 
>>> appear to respect signal interrupts.
>>>
>>>> That's a policy decision so why would i915 wedge if an arbitrary 
>>>> timeout expired? I915 is not controlling how much work there is 
>>>> outstanding at the point the IGT decides to call DROP_IDLE.
>>>
>>> Because this is a debug test interface that is used solely by IGT 
>>> after it has finished its testing. This is not about wedging the 
>>> device at some random arbitrary point because an AI compute workload 
>>> takes three hours to complete. This is about a very specific test 
>>> framework cleaning up after testing is completed and making sure the 
>>> test did not fry the system.
>>>
>>> And even if an IGT test was calling DROP_IDLE in the middle of a test 
>>> for some reason, it should not be deliberately pushing 10+ seconds of 
>>> work through and then calling a debug only interface to flush it out. 
>>> If a test wants to verify that the system can cope with submitting a 
>>> minutes worth of rendering and then waiting for it to complete then 
>>> the test should be using official channels for that wait.
>>>
>>>>
>>>>> Plus, infinite wait is not a valid code path in the first place so 
>>>>> any change in behaviour is not really a change in behaviour. Code 
>>>>> can't be relying on a kernel call to never return for its correct 
>>>>> operation!
>>>>
>>>> Why infinite wait wouldn't be valid? Then you better change the 
>>>> other one as well. ;P
>>> In what universe is it ever valid to wait forever for a test to 
>>> complete?
>>
>> Well above you claimed both paths respect SIGINT. If that is so then 
>> the wait is as infinite as the IGT wanted it to be.
>>
>>> See above, the PM code would require much more invasive changes. This 
>>> was low hanging fruit. It was supposed to be a two minute change to a 
>>> very self contained section of code that would provide significant 
>>> benefit to debugging a small class of very hard to debug problems.
>>
>> Sure, but I'd still like to know why can't you do what you want from 
>> the IGT framework.
>>
>> Have the timeout reduction in i915, again that's fine assuming 10 
>> seconds it enough to not break something by accident.
> CI showed no regressions. And if someone does find a valid reason why a 
> post test drop caches call should legitimately take a stupidly long time 
> then it is easy to track back where the ETIME error came from and bump 
> the timeout.
> 
>>
>> With that change you already have broken the "infinite wait". It makes 
>> the debugfs write return -ETIME in time much shorter than the test 
>> runner timeout(s). What is the thing that you cannot do from IGT at 
>> that point is my question? You want to wedge then? Send 
>> DROP_RESET_ACTIVE to do it for you? If that doesn't work add a new 
>> flag which will wedge explicitly.
>>
>> We are again degrading into a huge philosophical discussion and all I 
>> wanted to start with is to hear how exactly things go bad.
>>
> I have no idea what you are wanting. I am trying to have a technical 
> discussion about improving the stability of the driver during CI 
> testing. I have no idea if you are arguing that this change is good, 
> bad, broken, wrong direction or what.
> 
> Things go bad as explained in the commit message. The CI framework does 
> not use signals. The IGT framework does not use signals. There is no 
> watchdog that sends a TERM or KILL signal after a specified timeout. All 
> that happens is the IGT sits there forever waiting for the drop caches 
> IOCTL to return. The CI framework eventually gives up waiting for the 
> test to complete and tries to recover. There are many different CI 
> frameworks in use across Intel. Some timeout quickly, some timeout 
> slowly. But basically, they all eventually give up and don't bother 
> trying any kind of remedial action but just hit the reset button 
> (sometimes by literally power cycling the DUT). As result, background 
> processes that are saving dmesg, stdout, etc do not necessarily 
> terminate cleanly. That results in logs that are at best truncated, at 
> worst missing entirely. It also results in some frameworks aborting 
> testing at that point. So no results are generated for all the other 
> tests that have yet to be run. Some frameworks also run tests in 
> batches. All they log is that something, somewhere in the batch died. So 
> you don't even know which specific test actually hit the problem.
> 
> Can the CI frameworks be improved? Undoubtedly. In very many ways. Is 
> that something we have the ability to do with a simple patch? No. Would 
> re-writing the IGT framework to add watchdog mechanisms improve things? 
> Yes. Can it be done with a simple patch? No. Would a simple patch to 
> i915 significantly improve the situation? Yes. Will it solve every 
> possible CI hang? No. Will it fix any actual end user visible bugs? No. 
> Will it introduce any new bugs? No. Will it help us to debug at least 
> some CI failures? Yes.

To unblock, I suggest you go with the patch which caps the wait only, 
and propose a wedging as an IGT patch to gem_quiescent_gpu(). That 
should involve the CI/IGT folks into discussion on what logs will be, or 
will not be collected once gem_quiescent_gpu() fails due -ETIME. In fact 
probably you should copy CI/IGT folks on the v2 of the i915 patch as 
well since I now think their acks would be good to have - from the point 
of view of the current test runner behaviour with hanging tests.

Regards,

Tvrtko
John Harrison Nov. 8, 2022, 7:37 p.m. UTC | #15
On 11/8/2022 01:08, Tvrtko Ursulin wrote:
> On 07/11/2022 19:45, John Harrison wrote:
>> On 11/7/2022 06:09, Tvrtko Ursulin wrote:
>>> On 04/11/2022 17:45, John Harrison wrote:
>>>> On 11/4/2022 03:01, Tvrtko Ursulin wrote:
>>>>> On 03/11/2022 19:16, John Harrison wrote:
>>>>>> On 11/3/2022 02:38, Tvrtko Ursulin wrote:
>>>>>>> On 03/11/2022 09:18, Tvrtko Ursulin wrote:
>>>>>>>> On 03/11/2022 01:33, John Harrison wrote:
>>>>>>>>> On 11/2/2022 07:20, Tvrtko Ursulin wrote:
>>>>>>>>>> On 02/11/2022 12:12, Jani Nikula wrote:
>>>>>>>>>>> On Tue, 01 Nov 2022, John.C.Harrison@Intel.com wrote:
>>>>>>>>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>>>>>>>>
>>>>>>>>>>>> At the end of each test, IGT does a drop caches call via 
>>>>>>>>>>>> sysfs with
>>>>>>>>>>>
>>>>>>>>>>> sysfs?
>>>>>>>>> Sorry, that was meant to say debugfs. I've also been working 
>>>>>>>>> on some sysfs IGT issues and evidently got my wires crossed!
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> special flags set. One of the possible paths waits for idle 
>>>>>>>>>>>> with an
>>>>>>>>>>>> infinite timeout. That causes problems for debugging issues 
>>>>>>>>>>>> when CI
>>>>>>>>>>>> catches a "can't go idle" test failure. Best case, the CI 
>>>>>>>>>>>> system times
>>>>>>>>>>>> out (after 90s), attempts a bunch of state dump actions and 
>>>>>>>>>>>> then
>>>>>>>>>>>> reboots the system to recover it. Worst case, the CI system 
>>>>>>>>>>>> can't do
>>>>>>>>>>>> anything at all and then times out (after 1000s) and simply 
>>>>>>>>>>>> reboots.
>>>>>>>>>>>> Sometimes a serial port log of dmesg might be available, 
>>>>>>>>>>>> sometimes not.
>>>>>>>>>>>>
>>>>>>>>>>>> So rather than making life hard for ourselves, change the 
>>>>>>>>>>>> timeout to
>>>>>>>>>>>> be 10s rather than infinite. Also, trigger the standard
>>>>>>>>>>>> wedge/reset/recover sequence so that testing can continue 
>>>>>>>>>>>> with a
>>>>>>>>>>>> working system (if possible).
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>   drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
>>>>>>>>>>>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>>>>>>>>
>>>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c 
>>>>>>>>>>>> b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>>>>>> index ae987e92251dd..9d916fbbfc27c 100644
>>>>>>>>>>>> --- a/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>>>>>> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>>>>>> @@ -641,6 +641,9 @@ 
>>>>>>>>>>>> DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
>>>>>>>>>>>>             DROP_RESET_ACTIVE | \
>>>>>>>>>>>>             DROP_RESET_SEQNO | \
>>>>>>>>>>>>             DROP_RCU)
>>>>>>>>>>>> +
>>>>>>>>>>>> +#define DROP_IDLE_TIMEOUT    (HZ * 10)
>>>>>>>>>>>
>>>>>>>>>>> I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's 
>>>>>>>>>>> also only used
>>>>>>>>>>> here.
>>>>>>>>>>
>>>>>>>>>> So move here, dropping i915 prefix, next to the newly 
>>>>>>>>>> proposed one?
>>>>>>>>> Sure, can do that.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only 
>>>>>>>>>>> used in
>>>>>>>>>>> gt/intel_gt.c.
>>>>>>>>>>
>>>>>>>>>> Move there and rename to GT_IDLE_TIMEOUT?
>>>>>>>>>>
>>>>>>>>>>> I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only in 
>>>>>>>>>>> intel_gt_pm.c.
>>>>>>>>>>
>>>>>>>>>> No action needed, maybe drop i915 prefix if wanted.
>>>>>>>>>>
>>>>>>>>> These two are totally unrelated and in code not being touched 
>>>>>>>>> by this change. I would rather not conflate changing random 
>>>>>>>>> other things with fixing this specific issue.
>>>>>>>>>
>>>>>>>>>>> I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies.
>>>>>>>>>>
>>>>>>>>>> Add _MS suffix if wanted.
>>>>>>>>>>
>>>>>>>>>>> My head spins.
>>>>>>>>>>
>>>>>>>>>> I follow and raise that the newly proposed DROP_IDLE_TIMEOUT 
>>>>>>>>>> applies to DROP_ACTIVE and not only DROP_IDLE.
>>>>>>>>> My original intention for the name was that is the 'drop 
>>>>>>>>> caches timeout for intel_gt_wait_for_idle'. Which is quite the 
>>>>>>>>> mouthful and hence abbreviated to DROP_IDLE_TIMEOUT. But yes, 
>>>>>>>>> I realised later that name can be conflated with the DROP_IDLE 
>>>>>>>>> flag. Will rename.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Things get refactored, code moves around, bits get left 
>>>>>>>>>> behind, who knows. No reason to get too worked up. :) As long 
>>>>>>>>>> as people are taking a wider view when touching the code 
>>>>>>>>>> base, and are not afraid to send cleanups, things should be 
>>>>>>>>>> good.
>>>>>>>>> On the other hand, if every patch gets blocked in code review 
>>>>>>>>> because someone points out some completely unrelated piece of 
>>>>>>>>> code could be a bit better then nothing ever gets fixed. If 
>>>>>>>>> you spot something that you think should be improved, isn't 
>>>>>>>>> the general idea that you should post a patch yourself to 
>>>>>>>>> improve it?
>>>>>>>>
>>>>>>>> There's two maintainers per branch and an order of magnitude or 
>>>>>>>> two more developers so it'd be nice if cleanups would just be 
>>>>>>>> incoming on self-initiative basis. ;)
>>>>>>>>
>>>>>>>>>> For the actual functional change at hand - it would be nice 
>>>>>>>>>> if code paths in question could handle SIGINT and then we 
>>>>>>>>>> could punt the decision on how long someone wants to wait 
>>>>>>>>>> purely to userspace. But it's probably hard and it's only 
>>>>>>>>>> debugfs so whatever.
>>>>>>>>>>
>>>>>>>>> The code paths in question will already abort on a signal 
>>>>>>>>> won't they? Both intel_gt_wait_for_idle() and 
>>>>>>>>> intel_guc_wait_for_pending_msg(), which is where the 
>>>>>>>>> uc_wait_for_idle eventually ends up, have an 
>>>>>>>>> 'if(signal_pending) return -EINTR;' check. Beyond that, it 
>>>>>>>>> sounds like what you are asking for is a change in the IGT 
>>>>>>>>> libraries and/or CI framework to start sending signals after 
>>>>>>>>> some specific timeout. That seems like a significantly more 
>>>>>>>>> complex change (in terms of the number of entities affected 
>>>>>>>>> and number of groups involved) and unnecessary.
>>>>>>>>
>>>>>>>> If you say so, I haven't looked at them all. But if the code 
>>>>>>>> path in question already aborts on signals then I am not sure 
>>>>>>>> what is the patch fixing? I assumed you are trying to avoid the 
>>>>>>>> write stuck in D forever, which then prevents driver unload and 
>>>>>>>> everything, requiring the test runner to eventually reboot. If 
>>>>>>>> you say SIGINT works then you can already recover from 
>>>>>>>> userspace, no?
>>>>>>>>
>>>>>>>>>> Whether or not 10s is enough CI will hopefully tell us. I'd 
>>>>>>>>>> probably err on the side of safety and make it longer, but at 
>>>>>>>>>> most half from the test runner timeout.
>>>>>>>>> This is supposed to be test clean up. This is not about how 
>>>>>>>>> long a particular test takes to complete but about how long it 
>>>>>>>>> takes to declare the system broken after the test has already 
>>>>>>>>> finished. I would argue that even 10s is massively longer than 
>>>>>>>>> required.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I am not convinced that wedging is correct though. 
>>>>>>>>>> Conceptually could be just that the timeout is too short. 
>>>>>>>>>> What does wedging really give us, on top of limiting the 
>>>>>>>>>> wait, when latter AFAIU is the key factor which would prevent 
>>>>>>>>>> the need to reboot the machine?
>>>>>>>>>>
>>>>>>>>> It gives us a system that knows what state it is in. If we 
>>>>>>>>> can't idle the GT then something is very badly wrong. Wedging 
>>>>>>>>> indicates that. It also ensure that a full GT reset will be 
>>>>>>>>> attempted before the next test is run. Helping to prevent a 
>>>>>>>>> failure on test X from propagating into failures of unrelated 
>>>>>>>>> tests X+1, X+2, ... And if the GT reset does not work because 
>>>>>>>>> the system is really that badly broken then future tests will 
>>>>>>>>> not run rather than report erroneous failures.
>>>>>>>>>
>>>>>>>>> This is not about getting a more stable system for end users 
>>>>>>>>> by sweeping issues under the carpet and pretending all is 
>>>>>>>>> well. End users don't run IGTs or explicitly call dodgy 
>>>>>>>>> debugfs entry points. The sole motivation here is to get more 
>>>>>>>>> accurate results from CI. That is, correctly identifying which 
>>>>>>>>> test has hit a problem, getting valid debug analysis for that 
>>>>>>>>> test (logs and such) and allowing further testing to complete 
>>>>>>>>> correctly in the case where the system can be recovered.
>>>>>>>>
>>>>>>>> I don't really oppose shortening of the timeout in principle, 
>>>>>>>> just want a clear statement if this is something IGT / test 
>>>>>>>> runner could already do or not. It can apply a timeout, it can 
>>>>>>>> also send SIGINT, and it could even trigger a reset from 
>>>>>>>> outside. Sure it is debugfs hacks so general "kernel should not 
>>>>>>>> implement policy" need not be strictly followed, but lets have 
>>>>>>>> it clear what are the options.
>>>>>>>
>>>>>>> One conceptual problem with applying this policy is that the 
>>>>>>> code is:
>>>>>>>
>>>>>>>     if (val & (DROP_IDLE | DROP_ACTIVE)) {
>>>>>>>         ret = intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
>>>>>>>         if (ret)
>>>>>>>             return ret;
>>>>>>>     }
>>>>>>>
>>>>>>>     if (val & DROP_IDLE) {
>>>>>>>         ret = intel_gt_pm_wait_for_idle(gt);
>>>>>>>         if (ret)
>>>>>>>             return ret;
>>>>>>>     }
>>>>>>>
>>>>>>> So if someone passes in DROP_IDLE and then why would only the 
>>>>>>> first branch have a short timeout and wedge. Yeah some bug 
>>>>>>> happens to be there at the moment, but put a bug in a different 
>>>>>>> place and you hang on the second branch and then need another 
>>>>>>> patch. Versus perhaps making it all respect SIGINT and handle 
>>>>>>> from outside.
>>>>>>>
>>>>>> The pm_wait_for_idle is can only called after gt_wait_for_idle 
>>>>>> has completed successfully. There is no route to skip the GT idle 
>>>>>> or to do the PM idle even if the GT idle fails. So the chances of 
>>>>>> the PM idle failing are greatly reduced. There would have to be 
>>>>>> something outside of a GT keeping the GPU awake and there isn't a 
>>>>>> whole lot of hardware left at that point!
>>>>>
>>>>> Well "greatly reduced" is beside my point. Point is today bug is 
>>>>> here and we add a timeout, tomorrow bug is there and then the same 
>>>>> dance. It can be just a sw bug which forgets to release the pm ref 
>>>>> in some circumstances, doesn't really matter.
>>>>>
>>>> Huh?
>>>>
>>>> Greatly reduced is the whole point. Today there is a bug and it 
>>>> causes a kernel hang which requires the CI framework to reboot the 
>>>> system in an extremely unfriendly way which makes it very hard to 
>>>> work out what happened. Logs are likely not available. We don't 
>>>> even necessarily know which test was being run at the time. Etc. So 
>>>> we replace the infinite timeout with a meaningful timeout. CI now 
>>>> correctly marks the single test as failing, captures all the 
>>>> correct logs, creates a useful bug report and continues on testing 
>>>> more stuff.
>>>
>>> So what is preventing CI to collect logs if IGT is forever stuck in 
>>> interruptible wait? Surely it can collect the logs at that point if 
>>> the kernel is healthy enough. If it isn't then I don't see how 
>>> wedging the GPU will make the kernel any healthier.
>>>
>>> Is i915 preventing better log collection or could test runner be 
>>> improved?
>>>
>>>> Sure, there is still the chance of hitting an infinite timeout. But 
>>>> that one is significantly more complicated to remove. And the 
>>>> chances of hitting that one are significantly smaller than the 
>>>> chances of hitting the first one.
>>>
>>> This statement relies on intimate knowledge implementation details 
>>> and a bit too much white box testing approach but that's okay, lets 
>>> move past this one.
>>>
>>>> So you are arguing that because I can't fix the last 0.1% of 
>>>> possible failures, I am not allowed to fix the first 99.9% of the 
>>>> failures?
>>>
>>> I am clearly not arguing for that. But we are also not talking about 
>>> "fixing failures" here. Just how to make CI cope better with a class 
>>> of i915 bugs.
>>>
>>>>>> Regarding signals, the PM idle code ends up at 
>>>>>> wait_var_event_killable(). I assume that is interruptible via at 
>>>>>> least a KILL signal if not any signal. Although it's not entirely 
>>>>>> clear trying to follow through the implementation of this code. 
>>>>>> Also, I have no idea if there is a safe way to add a timeout to 
>>>>>> that code (or why it wasn't already written with a timeout 
>>>>>> included). Someone more familiar with the wakeref internals would 
>>>>>> need to comment.
>>>>>>
>>>>>> However, I strongly disagree that we should not fix the driver 
>>>>>> just because it is possible to workaround the issue by re-writing 
>>>>>> the CI framework. Feel free to bring a redesign plan to the IGT 
>>>>>> WG and whatever equivalent CI meetings in parallel. But we 
>>>>>> absolutely should not have infinite waits in the kernel if there 
>>>>>> is a trivial way to not have infinite waits.
>>>>>
>>>>> I thought I was clear that I am not really opposed to the timeout.
>>>>>
>>>>> The rest of the paragraph I don't really care - point is moot 
>>>>> because it's debugfs so we can do whatever, as long as it is not 
>>>>> burdensome to i915, which this isn't. If either wasn't the case 
>>>>> then we certainly wouldn't be adding any workarounds in the kernel 
>>>>> if it can be achieved in IGT.
>>>>>
>>>>>> Also, sending a signal does not result in the wedge happening. I 
>>>>>> specifically did not want to change that code path because I was 
>>>>>> assuming there was a valid reason for it. If you have been 
>>>>>> interrupted then you are in the territory of maybe it would have 
>>>>>> succeeded if you just left it for a moment longer. Whereas, 
>>>>>> hitting the timeout says that someone very deliberately said this 
>>>>>> is too long to wait and therefore the system must be broken.
>>>>>
>>>>> I wanted to know specifically about wedging - why can't you 
>>>>> wedge/reset from IGT if DROP_IDLE times out in quiescent or 
>>>>> wherever, if that's what you say is the right thing? 
>>>> Huh?
>>>>
>>>> DROP_IDLE has two waits. One that I am trying to change from 
>>>> infinite to finite + wedge. One that would take considerable effort 
>>>> to change and would be quite invasive to a lot more of the driver 
>>>> and which can only be hit if the first timeout actually completed 
>>>> successfully and is therefore of less importance anyway. Both of 
>>>> those time outs appear to respect signal interrupts.
>>>>
>>>>> That's a policy decision so why would i915 wedge if an arbitrary 
>>>>> timeout expired? I915 is not controlling how much work there is 
>>>>> outstanding at the point the IGT decides to call DROP_IDLE.
>>>>
>>>> Because this is a debug test interface that is used solely by IGT 
>>>> after it has finished its testing. This is not about wedging the 
>>>> device at some random arbitrary point because an AI compute 
>>>> workload takes three hours to complete. This is about a very 
>>>> specific test framework cleaning up after testing is completed and 
>>>> making sure the test did not fry the system.
>>>>
>>>> And even if an IGT test was calling DROP_IDLE in the middle of a 
>>>> test for some reason, it should not be deliberately pushing 10+ 
>>>> seconds of work through and then calling a debug only interface to 
>>>> flush it out. If a test wants to verify that the system can cope 
>>>> with submitting a minutes worth of rendering and then waiting for 
>>>> it to complete then the test should be using official channels for 
>>>> that wait.
>>>>
>>>>>
>>>>>> Plus, infinite wait is not a valid code path in the first place 
>>>>>> so any change in behaviour is not really a change in behaviour. 
>>>>>> Code can't be relying on a kernel call to never return for its 
>>>>>> correct operation!
>>>>>
>>>>> Why infinite wait wouldn't be valid? Then you better change the 
>>>>> other one as well. ;P
>>>> In what universe is it ever valid to wait forever for a test to 
>>>> complete?
>>>
>>> Well above you claimed both paths respect SIGINT. If that is so then 
>>> the wait is as infinite as the IGT wanted it to be.
>>>
>>>> See above, the PM code would require much more invasive changes. 
>>>> This was low hanging fruit. It was supposed to be a two minute 
>>>> change to a very self contained section of code that would provide 
>>>> significant benefit to debugging a small class of very hard to 
>>>> debug problems.
>>>
>>> Sure, but I'd still like to know why can't you do what you want from 
>>> the IGT framework.
>>>
>>> Have the timeout reduction in i915, again that's fine assuming 10 
>>> seconds it enough to not break something by accident.
>> CI showed no regressions. And if someone does find a valid reason why 
>> a post test drop caches call should legitimately take a stupidly long 
>> time then it is easy to track back where the ETIME error came from 
>> and bump the timeout.
>>
>>>
>>> With that change you already have broken the "infinite wait". It 
>>> makes the debugfs write return -ETIME in time much shorter than the 
>>> test runner timeout(s). What is the thing that you cannot do from 
>>> IGT at that point is my question? You want to wedge then? Send 
>>> DROP_RESET_ACTIVE to do it for you? If that doesn't work add a new 
>>> flag which will wedge explicitly.
>>>
>>> We are again degrading into a huge philosophical discussion and all 
>>> I wanted to start with is to hear how exactly things go bad.
>>>
>> I have no idea what you are wanting. I am trying to have a technical 
>> discussion about improving the stability of the driver during CI 
>> testing. I have no idea if you are arguing that this change is good, 
>> bad, broken, wrong direction or what.
>>
>> Things go bad as explained in the commit message. The CI framework 
>> does not use signals. The IGT framework does not use signals. There 
>> is no watchdog that sends a TERM or KILL signal after a specified 
>> timeout. All that happens is the IGT sits there forever waiting for 
>> the drop caches IOCTL to return. The CI framework eventually gives up 
>> waiting for the test to complete and tries to recover. There are many 
>> different CI frameworks in use across Intel. Some timeout quickly, 
>> some timeout slowly. But basically, they all eventually give up and 
>> don't bother trying any kind of remedial action but just hit the 
>> reset button (sometimes by literally power cycling the DUT). As 
>> result, background processes that are saving dmesg, stdout, etc do 
>> not necessarily terminate cleanly. That results in logs that are at 
>> best truncated, at worst missing entirely. It also results in some 
>> frameworks aborting testing at that point. So no results are 
>> generated for all the other tests that have yet to be run. Some 
>> frameworks also run tests in batches. All they log is that something, 
>> somewhere in the batch died. So you don't even know which specific 
>> test actually hit the problem.
>>
>> Can the CI frameworks be improved? Undoubtedly. In very many ways. Is 
>> that something we have the ability to do with a simple patch? No. 
>> Would re-writing the IGT framework to add watchdog mechanisms improve 
>> things? Yes. Can it be done with a simple patch? No. Would a simple 
>> patch to i915 significantly improve the situation? Yes. Will it solve 
>> every possible CI hang? No. Will it fix any actual end user visible 
>> bugs? No. Will it introduce any new bugs? No. Will it help us to 
>> debug at least some CI failures? Yes.
>
> To unblock, I suggest you go with the patch which caps the wait only, 
> and propose a wedging as an IGT patch to gem_quiescent_gpu(). That 
> should involve the CI/IGT folks into discussion on what logs will be, 
> or will not be collected once gem_quiescent_gpu() fails due -ETIME. In 
> fact probably you should copy CI/IGT folks on the v2 of the i915 patch 
> as well since I now think their acks would be good to have - from the 
> point of view of the current test runner behaviour with hanging tests.
>
Simply returning -ETIME without wedging will actually make the situation 
worse. At the moment, you get 'all testing stopped due to machine not 
responding' bugs being logged. Which is a right pain and has very little 
useful information, but at least is not claiming random tests are broken 
when they are not. If you return ETIME without wedging then test A will 
hang and return ETIME. CI will log an ETIME bug against test A. CI will 
then try test B, which will fail with ETIME because the system is still 
broken but claiming to be working. So log a new bug against test B. Move 
on to test C, oh look, ETIME - log another bug and move on to test D... 
That is far worse, a whole slew of pointless and incorrect bugs have 
just been logged.

And how is it possibly considered a backwards breaking or dangerous 
change to wedge instead of hanging forever? Reboot versus wedge. 
Absolutely no defined behaviour at all because the system has simply 
stopped versus marking the system as broken and having a best effort at 
handling the situation. Yup, that's definitely a very dangerous change 
that could break all sorts of random user applications.

Re 'IGT folks' - whom? Ashutosh had already agreed to the original patch.

And CI folks are certainly aware of such issues. There are any number of 
comments in Jiras about 'no logs available, cannot analyse'.

John.


> Regards,
>
> Tvrtko
Tvrtko Ursulin Nov. 9, 2022, 11:35 a.m. UTC | #16
On 08/11/2022 19:37, John Harrison wrote:
> On 11/8/2022 01:08, Tvrtko Ursulin wrote:
>> On 07/11/2022 19:45, John Harrison wrote:
>>> On 11/7/2022 06:09, Tvrtko Ursulin wrote:
>>>> On 04/11/2022 17:45, John Harrison wrote:
>>>>> On 11/4/2022 03:01, Tvrtko Ursulin wrote:
>>>>>> On 03/11/2022 19:16, John Harrison wrote:
>>>>>>> On 11/3/2022 02:38, Tvrtko Ursulin wrote:
>>>>>>>> On 03/11/2022 09:18, Tvrtko Ursulin wrote:
>>>>>>>>> On 03/11/2022 01:33, John Harrison wrote:
>>>>>>>>>> On 11/2/2022 07:20, Tvrtko Ursulin wrote:
>>>>>>>>>>> On 02/11/2022 12:12, Jani Nikula wrote:
>>>>>>>>>>>> On Tue, 01 Nov 2022, John.C.Harrison@Intel.com wrote:
>>>>>>>>>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> At the end of each test, IGT does a drop caches call via 
>>>>>>>>>>>>> sysfs with
>>>>>>>>>>>>
>>>>>>>>>>>> sysfs?
>>>>>>>>>> Sorry, that was meant to say debugfs. I've also been working 
>>>>>>>>>> on some sysfs IGT issues and evidently got my wires crossed!
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> special flags set. One of the possible paths waits for idle 
>>>>>>>>>>>>> with an
>>>>>>>>>>>>> infinite timeout. That causes problems for debugging issues 
>>>>>>>>>>>>> when CI
>>>>>>>>>>>>> catches a "can't go idle" test failure. Best case, the CI 
>>>>>>>>>>>>> system times
>>>>>>>>>>>>> out (after 90s), attempts a bunch of state dump actions and 
>>>>>>>>>>>>> then
>>>>>>>>>>>>> reboots the system to recover it. Worst case, the CI system 
>>>>>>>>>>>>> can't do
>>>>>>>>>>>>> anything at all and then times out (after 1000s) and simply 
>>>>>>>>>>>>> reboots.
>>>>>>>>>>>>> Sometimes a serial port log of dmesg might be available, 
>>>>>>>>>>>>> sometimes not.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So rather than making life hard for ourselves, change the 
>>>>>>>>>>>>> timeout to
>>>>>>>>>>>>> be 10s rather than infinite. Also, trigger the standard
>>>>>>>>>>>>> wedge/reset/recover sequence so that testing can continue 
>>>>>>>>>>>>> with a
>>>>>>>>>>>>> working system (if possible).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>   drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
>>>>>>>>>>>>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c 
>>>>>>>>>>>>> b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>>>>>>> index ae987e92251dd..9d916fbbfc27c 100644
>>>>>>>>>>>>> --- a/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>>>>>>> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>>>>>>> @@ -641,6 +641,9 @@ 
>>>>>>>>>>>>> DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
>>>>>>>>>>>>>             DROP_RESET_ACTIVE | \
>>>>>>>>>>>>>             DROP_RESET_SEQNO | \
>>>>>>>>>>>>>             DROP_RCU)
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +#define DROP_IDLE_TIMEOUT    (HZ * 10)
>>>>>>>>>>>>
>>>>>>>>>>>> I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's 
>>>>>>>>>>>> also only used
>>>>>>>>>>>> here.
>>>>>>>>>>>
>>>>>>>>>>> So move here, dropping i915 prefix, next to the newly 
>>>>>>>>>>> proposed one?
>>>>>>>>>> Sure, can do that.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only 
>>>>>>>>>>>> used in
>>>>>>>>>>>> gt/intel_gt.c.
>>>>>>>>>>>
>>>>>>>>>>> Move there and rename to GT_IDLE_TIMEOUT?
>>>>>>>>>>>
>>>>>>>>>>>> I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only in 
>>>>>>>>>>>> intel_gt_pm.c.
>>>>>>>>>>>
>>>>>>>>>>> No action needed, maybe drop i915 prefix if wanted.
>>>>>>>>>>>
>>>>>>>>>> These two are totally unrelated and in code not being touched 
>>>>>>>>>> by this change. I would rather not conflate changing random 
>>>>>>>>>> other things with fixing this specific issue.
>>>>>>>>>>
>>>>>>>>>>>> I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies.
>>>>>>>>>>>
>>>>>>>>>>> Add _MS suffix if wanted.
>>>>>>>>>>>
>>>>>>>>>>>> My head spins.
>>>>>>>>>>>
>>>>>>>>>>> I follow and raise that the newly proposed DROP_IDLE_TIMEOUT 
>>>>>>>>>>> applies to DROP_ACTIVE and not only DROP_IDLE.
>>>>>>>>>> My original intention for the name was that is the 'drop 
>>>>>>>>>> caches timeout for intel_gt_wait_for_idle'. Which is quite the 
>>>>>>>>>> mouthful and hence abbreviated to DROP_IDLE_TIMEOUT. But yes, 
>>>>>>>>>> I realised later that name can be conflated with the DROP_IDLE 
>>>>>>>>>> flag. Will rename.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Things get refactored, code moves around, bits get left 
>>>>>>>>>>> behind, who knows. No reason to get too worked up. :) As long 
>>>>>>>>>>> as people are taking a wider view when touching the code 
>>>>>>>>>>> base, and are not afraid to send cleanups, things should be 
>>>>>>>>>>> good.
>>>>>>>>>> On the other hand, if every patch gets blocked in code review 
>>>>>>>>>> because someone points out some completely unrelated piece of 
>>>>>>>>>> code could be a bit better then nothing ever gets fixed. If 
>>>>>>>>>> you spot something that you think should be improved, isn't 
>>>>>>>>>> the general idea that you should post a patch yourself to 
>>>>>>>>>> improve it?
>>>>>>>>>
>>>>>>>>> There's two maintainers per branch and an order of magnitude or 
>>>>>>>>> two more developers so it'd be nice if cleanups would just be 
>>>>>>>>> incoming on self-initiative basis. ;)
>>>>>>>>>
>>>>>>>>>>> For the actual functional change at hand - it would be nice 
>>>>>>>>>>> if code paths in question could handle SIGINT and then we 
>>>>>>>>>>> could punt the decision on how long someone wants to wait 
>>>>>>>>>>> purely to userspace. But it's probably hard and it's only 
>>>>>>>>>>> debugfs so whatever.
>>>>>>>>>>>
>>>>>>>>>> The code paths in question will already abort on a signal 
>>>>>>>>>> won't they? Both intel_gt_wait_for_idle() and 
>>>>>>>>>> intel_guc_wait_for_pending_msg(), which is where the 
>>>>>>>>>> uc_wait_for_idle eventually ends up, have an 
>>>>>>>>>> 'if(signal_pending) return -EINTR;' check. Beyond that, it 
>>>>>>>>>> sounds like what you are asking for is a change in the IGT 
>>>>>>>>>> libraries and/or CI framework to start sending signals after 
>>>>>>>>>> some specific timeout. That seems like a significantly more 
>>>>>>>>>> complex change (in terms of the number of entities affected 
>>>>>>>>>> and number of groups involved) and unnecessary.
>>>>>>>>>
>>>>>>>>> If you say so, I haven't looked at them all. But if the code 
>>>>>>>>> path in question already aborts on signals then I am not sure 
>>>>>>>>> what is the patch fixing? I assumed you are trying to avoid the 
>>>>>>>>> write stuck in D forever, which then prevents driver unload and 
>>>>>>>>> everything, requiring the test runner to eventually reboot. If 
>>>>>>>>> you say SIGINT works then you can already recover from 
>>>>>>>>> userspace, no?
>>>>>>>>>
>>>>>>>>>>> Whether or not 10s is enough CI will hopefully tell us. I'd 
>>>>>>>>>>> probably err on the side of safety and make it longer, but at 
>>>>>>>>>>> most half from the test runner timeout.
>>>>>>>>>> This is supposed to be test clean up. This is not about how 
>>>>>>>>>> long a particular test takes to complete but about how long it 
>>>>>>>>>> takes to declare the system broken after the test has already 
>>>>>>>>>> finished. I would argue that even 10s is massively longer than 
>>>>>>>>>> required.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I am not convinced that wedging is correct though. 
>>>>>>>>>>> Conceptually could be just that the timeout is too short. 
>>>>>>>>>>> What does wedging really give us, on top of limiting the 
>>>>>>>>>>> wait, when latter AFAIU is the key factor which would prevent 
>>>>>>>>>>> the need to reboot the machine?
>>>>>>>>>>>
>>>>>>>>>> It gives us a system that knows what state it is in. If we 
>>>>>>>>>> can't idle the GT then something is very badly wrong. Wedging 
>>>>>>>>>> indicates that. It also ensure that a full GT reset will be 
>>>>>>>>>> attempted before the next test is run. Helping to prevent a 
>>>>>>>>>> failure on test X from propagating into failures of unrelated 
>>>>>>>>>> tests X+1, X+2, ... And if the GT reset does not work because 
>>>>>>>>>> the system is really that badly broken then future tests will 
>>>>>>>>>> not run rather than report erroneous failures.
>>>>>>>>>>
>>>>>>>>>> This is not about getting a more stable system for end users 
>>>>>>>>>> by sweeping issues under the carpet and pretending all is 
>>>>>>>>>> well. End users don't run IGTs or explicitly call dodgy 
>>>>>>>>>> debugfs entry points. The sole motivation here is to get more 
>>>>>>>>>> accurate results from CI. That is, correctly identifying which 
>>>>>>>>>> test has hit a problem, getting valid debug analysis for that 
>>>>>>>>>> test (logs and such) and allowing further testing to complete 
>>>>>>>>>> correctly in the case where the system can be recovered.
>>>>>>>>>
>>>>>>>>> I don't really oppose shortening of the timeout in principle, 
>>>>>>>>> just want a clear statement if this is something IGT / test 
>>>>>>>>> runner could already do or not. It can apply a timeout, it can 
>>>>>>>>> also send SIGINT, and it could even trigger a reset from 
>>>>>>>>> outside. Sure it is debugfs hacks so general "kernel should not 
>>>>>>>>> implement policy" need not be strictly followed, but lets have 
>>>>>>>>> it clear what are the options.
>>>>>>>>
>>>>>>>> One conceptual problem with applying this policy is that the 
>>>>>>>> code is:
>>>>>>>>
>>>>>>>>     if (val & (DROP_IDLE | DROP_ACTIVE)) {
>>>>>>>>         ret = intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
>>>>>>>>         if (ret)
>>>>>>>>             return ret;
>>>>>>>>     }
>>>>>>>>
>>>>>>>>     if (val & DROP_IDLE) {
>>>>>>>>         ret = intel_gt_pm_wait_for_idle(gt);
>>>>>>>>         if (ret)
>>>>>>>>             return ret;
>>>>>>>>     }
>>>>>>>>
>>>>>>>> So if someone passes in DROP_IDLE and then why would only the 
>>>>>>>> first branch have a short timeout and wedge. Yeah some bug 
>>>>>>>> happens to be there at the moment, but put a bug in a different 
>>>>>>>> place and you hang on the second branch and then need another 
>>>>>>>> patch. Versus perhaps making it all respect SIGINT and handle 
>>>>>>>> from outside.
>>>>>>>>
>>>>>>> The pm_wait_for_idle is can only called after gt_wait_for_idle 
>>>>>>> has completed successfully. There is no route to skip the GT idle 
>>>>>>> or to do the PM idle even if the GT idle fails. So the chances of 
>>>>>>> the PM idle failing are greatly reduced. There would have to be 
>>>>>>> something outside of a GT keeping the GPU awake and there isn't a 
>>>>>>> whole lot of hardware left at that point!
>>>>>>
>>>>>> Well "greatly reduced" is beside my point. Point is today bug is 
>>>>>> here and we add a timeout, tomorrow bug is there and then the same 
>>>>>> dance. It can be just a sw bug which forgets to release the pm ref 
>>>>>> in some circumstances, doesn't really matter.
>>>>>>
>>>>> Huh?
>>>>>
>>>>> Greatly reduced is the whole point. Today there is a bug and it 
>>>>> causes a kernel hang which requires the CI framework to reboot the 
>>>>> system in an extremely unfriendly way which makes it very hard to 
>>>>> work out what happened. Logs are likely not available. We don't 
>>>>> even necessarily know which test was being run at the time. Etc. So 
>>>>> we replace the infinite timeout with a meaningful timeout. CI now 
>>>>> correctly marks the single test as failing, captures all the 
>>>>> correct logs, creates a useful bug report and continues on testing 
>>>>> more stuff.
>>>>
>>>> So what is preventing CI to collect logs if IGT is forever stuck in 
>>>> interruptible wait? Surely it can collect the logs at that point if 
>>>> the kernel is healthy enough. If it isn't then I don't see how 
>>>> wedging the GPU will make the kernel any healthier.
>>>>
>>>> Is i915 preventing better log collection or could test runner be 
>>>> improved?
>>>>
>>>>> Sure, there is still the chance of hitting an infinite timeout. But 
>>>>> that one is significantly more complicated to remove. And the 
>>>>> chances of hitting that one are significantly smaller than the 
>>>>> chances of hitting the first one.
>>>>
>>>> This statement relies on intimate knowledge implementation details 
>>>> and a bit too much white box testing approach but that's okay, lets 
>>>> move past this one.
>>>>
>>>>> So you are arguing that because I can't fix the last 0.1% of 
>>>>> possible failures, I am not allowed to fix the first 99.9% of the 
>>>>> failures?
>>>>
>>>> I am clearly not arguing for that. But we are also not talking about 
>>>> "fixing failures" here. Just how to make CI cope better with a class 
>>>> of i915 bugs.
>>>>
>>>>>>> Regarding signals, the PM idle code ends up at 
>>>>>>> wait_var_event_killable(). I assume that is interruptible via at 
>>>>>>> least a KILL signal if not any signal. Although it's not entirely 
>>>>>>> clear trying to follow through the implementation of this code. 
>>>>>>> Also, I have no idea if there is a safe way to add a timeout to 
>>>>>>> that code (or why it wasn't already written with a timeout 
>>>>>>> included). Someone more familiar with the wakeref internals would 
>>>>>>> need to comment.
>>>>>>>
>>>>>>> However, I strongly disagree that we should not fix the driver 
>>>>>>> just because it is possible to workaround the issue by re-writing 
>>>>>>> the CI framework. Feel free to bring a redesign plan to the IGT 
>>>>>>> WG and whatever equivalent CI meetings in parallel. But we 
>>>>>>> absolutely should not have infinite waits in the kernel if there 
>>>>>>> is a trivial way to not have infinite waits.
>>>>>>
>>>>>> I thought I was clear that I am not really opposed to the timeout.
>>>>>>
>>>>>> The rest of the paragraph I don't really care - point is moot 
>>>>>> because it's debugfs so we can do whatever, as long as it is not 
>>>>>> burdensome to i915, which this isn't. If either wasn't the case 
>>>>>> then we certainly wouldn't be adding any workarounds in the kernel 
>>>>>> if it can be achieved in IGT.
>>>>>>
>>>>>>> Also, sending a signal does not result in the wedge happening. I 
>>>>>>> specifically did not want to change that code path because I was 
>>>>>>> assuming there was a valid reason for it. If you have been 
>>>>>>> interrupted then you are in the territory of maybe it would have 
>>>>>>> succeeded if you just left it for a moment longer. Whereas, 
>>>>>>> hitting the timeout says that someone very deliberately said this 
>>>>>>> is too long to wait and therefore the system must be broken.
>>>>>>
>>>>>> I wanted to know specifically about wedging - why can't you 
>>>>>> wedge/reset from IGT if DROP_IDLE times out in quiescent or 
>>>>>> wherever, if that's what you say is the right thing? 
>>>>> Huh?
>>>>>
>>>>> DROP_IDLE has two waits. One that I am trying to change from 
>>>>> infinite to finite + wedge. One that would take considerable effort 
>>>>> to change and would be quite invasive to a lot more of the driver 
>>>>> and which can only be hit if the first timeout actually completed 
>>>>> successfully and is therefore of less importance anyway. Both of 
>>>>> those time outs appear to respect signal interrupts.
>>>>>
>>>>>> That's a policy decision so why would i915 wedge if an arbitrary 
>>>>>> timeout expired? I915 is not controlling how much work there is 
>>>>>> outstanding at the point the IGT decides to call DROP_IDLE.
>>>>>
>>>>> Because this is a debug test interface that is used solely by IGT 
>>>>> after it has finished its testing. This is not about wedging the 
>>>>> device at some random arbitrary point because an AI compute 
>>>>> workload takes three hours to complete. This is about a very 
>>>>> specific test framework cleaning up after testing is completed and 
>>>>> making sure the test did not fry the system.
>>>>>
>>>>> And even if an IGT test was calling DROP_IDLE in the middle of a 
>>>>> test for some reason, it should not be deliberately pushing 10+ 
>>>>> seconds of work through and then calling a debug only interface to 
>>>>> flush it out. If a test wants to verify that the system can cope 
>>>>> with submitting a minutes worth of rendering and then waiting for 
>>>>> it to complete then the test should be using official channels for 
>>>>> that wait.
>>>>>
>>>>>>
>>>>>>> Plus, infinite wait is not a valid code path in the first place 
>>>>>>> so any change in behaviour is not really a change in behaviour. 
>>>>>>> Code can't be relying on a kernel call to never return for its 
>>>>>>> correct operation!
>>>>>>
>>>>>> Why infinite wait wouldn't be valid? Then you better change the 
>>>>>> other one as well. ;P
>>>>> In what universe is it ever valid to wait forever for a test to 
>>>>> complete?
>>>>
>>>> Well above you claimed both paths respect SIGINT. If that is so then 
>>>> the wait is as infinite as the IGT wanted it to be.
>>>>
>>>>> See above, the PM code would require much more invasive changes. 
>>>>> This was low hanging fruit. It was supposed to be a two minute 
>>>>> change to a very self contained section of code that would provide 
>>>>> significant benefit to debugging a small class of very hard to 
>>>>> debug problems.
>>>>
>>>> Sure, but I'd still like to know why can't you do what you want from 
>>>> the IGT framework.
>>>>
>>>> Have the timeout reduction in i915, again that's fine assuming 10 
>>>> seconds it enough to not break something by accident.
>>> CI showed no regressions. And if someone does find a valid reason why 
>>> a post test drop caches call should legitimately take a stupidly long 
>>> time then it is easy to track back where the ETIME error came from 
>>> and bump the timeout.
>>>
>>>>
>>>> With that change you already have broken the "infinite wait". It 
>>>> makes the debugfs write return -ETIME in time much shorter than the 
>>>> test runner timeout(s). What is the thing that you cannot do from 
>>>> IGT at that point is my question? You want to wedge then? Send 
>>>> DROP_RESET_ACTIVE to do it for you? If that doesn't work add a new 
>>>> flag which will wedge explicitly.
>>>>
>>>> We are again degrading into a huge philosophical discussion and all 
>>>> I wanted to start with is to hear how exactly things go bad.
>>>>
>>> I have no idea what you are wanting. I am trying to have a technical 
>>> discussion about improving the stability of the driver during CI 
>>> testing. I have no idea if you are arguing that this change is good, 
>>> bad, broken, wrong direction or what.
>>>
>>> Things go bad as explained in the commit message. The CI framework 
>>> does not use signals. The IGT framework does not use signals. There 
>>> is no watchdog that sends a TERM or KILL signal after a specified 
>>> timeout. All that happens is the IGT sits there forever waiting for 
>>> the drop caches IOCTL to return. The CI framework eventually gives up 
>>> waiting for the test to complete and tries to recover. There are many 
>>> different CI frameworks in use across Intel. Some timeout quickly, 
>>> some timeout slowly. But basically, they all eventually give up and 
>>> don't bother trying any kind of remedial action but just hit the 
>>> reset button (sometimes by literally power cycling the DUT). As 
>>> result, background processes that are saving dmesg, stdout, etc do 
>>> not necessarily terminate cleanly. That results in logs that are at 
>>> best truncated, at worst missing entirely. It also results in some 
>>> frameworks aborting testing at that point. So no results are 
>>> generated for all the other tests that have yet to be run. Some 
>>> frameworks also run tests in batches. All they log is that something, 
>>> somewhere in the batch died. So you don't even know which specific 
>>> test actually hit the problem.
>>>
>>> Can the CI frameworks be improved? Undoubtedly. In very many ways. Is 
>>> that something we have the ability to do with a simple patch? No. 
>>> Would re-writing the IGT framework to add watchdog mechanisms improve 
>>> things? Yes. Can it be done with a simple patch? No. Would a simple 
>>> patch to i915 significantly improve the situation? Yes. Will it solve 
>>> every possible CI hang? No. Will it fix any actual end user visible 
>>> bugs? No. Will it introduce any new bugs? No. Will it help us to 
>>> debug at least some CI failures? Yes.
>>
>> To unblock, I suggest you go with the patch which caps the wait only, 
>> and propose a wedging as an IGT patch to gem_quiescent_gpu(). That 
>> should involve the CI/IGT folks into discussion on what logs will be, 
>> or will not be collected once gem_quiescent_gpu() fails due -ETIME. In 
>> fact probably you should copy CI/IGT folks on the v2 of the i915 patch 
>> as well since I now think their acks would be good to have - from the 
>> point of view of the current test runner behaviour with hanging tests.
>>
> Simply returning -ETIME without wedging will actually make the situation 
> worse. At the moment, you get 'all testing stopped due to machine not 
> responding' bugs being logged. Which is a right pain and has very little 
> useful information, but at least is not claiming random tests are broken 
> when they are not. If you return ETIME without wedging then test A will 

Several times I asked why can't you wedge from gem_quiescent_gpu() since 
that is done on driver open. So the chain of failing tests describing 
below is not relevant to my question.

Whole point is why add policy to i915 if it can be done from userspace. 
Current API is called "wait for idle", not "wait for idle ten seconds 
max" (although fine since IGT will fail on timeout already), and not 
"wait for idle or wedge, sometimes".

Yes it's only debugfs, I said that multiple times already, so it could 
be whatever, but in principle adding code to kernel should always be the 
2nd option. Especially since the implementation is only a 50% kludge (I 
am referring to the 2nd DROP_IDLE branch where the proposal does not add 
a timeout or wedging). So a half-policy even. Wedge if this stage of 
DROP_IDLE timed out, but don't wedge if this other stage of DROP_IDLE 
timed out or failed.

Which is why I was saying, if signals would be respected anyway, why 
couldn't you do the whole thing in IGT to start with.. "wrap" 
gem_quiescent_gpu with alarm(2), send SIGINT, wedge, whatever. If that 
works it would be the same effect. And policy where it belongs, zero 
kernel code required. If it works.. I haven't checked, you said it would 
though so what would be wrong with this approach?

And completely separate from the above line of discussion I am not even 
sure how the "no logs" relates to this all. Sure some bugs result in no 
logs since kernel crashes so badly. This patch will not improve that.

And if the kernel is not badly broken test runner will detect a timeout 
and sure all further testing will be invalid. It's not the first, or 
last, or the only way that can happen. There will be logs though so it 
can be debugged and fixed. (Unless there can't be logs anyway.) So you 
find the first failing test and fix the issue. How often does it happen 
anyway.

Or if I totally got this wrong please paste some CI or CIbuglog links to 
put me on the correct path.

Regards,

Tvrtko

> hang and return ETIME. CI will log an ETIME bug against test A. CI will 
> then try test B, which will fail with ETIME because the system is still 
> broken but claiming to be working. So log a new bug against test B. Move 
> on to test C, oh look, ETIME - log another bug and move on to test D... 
> That is far worse, a whole slew of pointless and incorrect bugs have 
> just been logged.
> 
> And how is it possibly considered a backwards breaking or dangerous 
> change to wedge instead of hanging forever? Reboot versus wedge. 
> Absolutely no defined behaviour at all because the system has simply 
> stopped versus marking the system as broken and having a best effort at 
> handling the situation. Yup, that's definitely a very dangerous change 
> that could break all sorts of random user applications.
> 
> Re 'IGT folks' - whom? Ashutosh had already agreed to the original patch.
> 
> And CI folks are certainly aware of such issues. There are any number of 
> comments in Jiras about 'no logs available, cannot analyse'.
> 
> John.
> 
> 
>> Regards,
>>
>> Tvrtko
>
John Harrison Nov. 10, 2022, 6:20 a.m. UTC | #17
On 11/9/2022 03:35, Tvrtko Ursulin wrote:
> On 08/11/2022 19:37, John Harrison wrote:
>> On 11/8/2022 01:08, Tvrtko Ursulin wrote:
>>> On 07/11/2022 19:45, John Harrison wrote:
>>>> On 11/7/2022 06:09, Tvrtko Ursulin wrote:
>>>>> On 04/11/2022 17:45, John Harrison wrote:
>>>>>> On 11/4/2022 03:01, Tvrtko Ursulin wrote:
>>>>>>> On 03/11/2022 19:16, John Harrison wrote:
>>>>>>>> On 11/3/2022 02:38, Tvrtko Ursulin wrote:
>>>>>>>>> On 03/11/2022 09:18, Tvrtko Ursulin wrote:
>>>>>>>>>> On 03/11/2022 01:33, John Harrison wrote:
>>>>>>>>>>> On 11/2/2022 07:20, Tvrtko Ursulin wrote:
>>>>>>>>>>>> On 02/11/2022 12:12, Jani Nikula wrote:
>>>>>>>>>>>>> On Tue, 01 Nov 2022, John.C.Harrison@Intel.com wrote:
>>>>>>>>>>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> At the end of each test, IGT does a drop caches call via 
>>>>>>>>>>>>>> sysfs with
>>>>>>>>>>>>>
>>>>>>>>>>>>> sysfs?
>>>>>>>>>>> Sorry, that was meant to say debugfs. I've also been working 
>>>>>>>>>>> on some sysfs IGT issues and evidently got my wires crossed!
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> special flags set. One of the possible paths waits for 
>>>>>>>>>>>>>> idle with an
>>>>>>>>>>>>>> infinite timeout. That causes problems for debugging 
>>>>>>>>>>>>>> issues when CI
>>>>>>>>>>>>>> catches a "can't go idle" test failure. Best case, the CI 
>>>>>>>>>>>>>> system times
>>>>>>>>>>>>>> out (after 90s), attempts a bunch of state dump actions 
>>>>>>>>>>>>>> and then
>>>>>>>>>>>>>> reboots the system to recover it. Worst case, the CI 
>>>>>>>>>>>>>> system can't do
>>>>>>>>>>>>>> anything at all and then times out (after 1000s) and 
>>>>>>>>>>>>>> simply reboots.
>>>>>>>>>>>>>> Sometimes a serial port log of dmesg might be available, 
>>>>>>>>>>>>>> sometimes not.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So rather than making life hard for ourselves, change the 
>>>>>>>>>>>>>> timeout to
>>>>>>>>>>>>>> be 10s rather than infinite. Also, trigger the standard
>>>>>>>>>>>>>> wedge/reset/recover sequence so that testing can continue 
>>>>>>>>>>>>>> with a
>>>>>>>>>>>>>> working system (if possible).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>   drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
>>>>>>>>>>>>>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c 
>>>>>>>>>>>>>> b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>>>>>>>> index ae987e92251dd..9d916fbbfc27c 100644
>>>>>>>>>>>>>> --- a/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>>>>>>>> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
>>>>>>>>>>>>>> @@ -641,6 +641,9 @@ 
>>>>>>>>>>>>>> DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
>>>>>>>>>>>>>>             DROP_RESET_ACTIVE | \
>>>>>>>>>>>>>>             DROP_RESET_SEQNO | \
>>>>>>>>>>>>>>             DROP_RCU)
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +#define DROP_IDLE_TIMEOUT    (HZ * 10)
>>>>>>>>>>>>>
>>>>>>>>>>>>> I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's 
>>>>>>>>>>>>> also only used
>>>>>>>>>>>>> here.
>>>>>>>>>>>>
>>>>>>>>>>>> So move here, dropping i915 prefix, next to the newly 
>>>>>>>>>>>> proposed one?
>>>>>>>>>>> Sure, can do that.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only 
>>>>>>>>>>>>> used in
>>>>>>>>>>>>> gt/intel_gt.c.
>>>>>>>>>>>>
>>>>>>>>>>>> Move there and rename to GT_IDLE_TIMEOUT?
>>>>>>>>>>>>
>>>>>>>>>>>>> I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only in 
>>>>>>>>>>>>> intel_gt_pm.c.
>>>>>>>>>>>>
>>>>>>>>>>>> No action needed, maybe drop i915 prefix if wanted.
>>>>>>>>>>>>
>>>>>>>>>>> These two are totally unrelated and in code not being 
>>>>>>>>>>> touched by this change. I would rather not conflate changing 
>>>>>>>>>>> random other things with fixing this specific issue.
>>>>>>>>>>>
>>>>>>>>>>>>> I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies.
>>>>>>>>>>>>
>>>>>>>>>>>> Add _MS suffix if wanted.
>>>>>>>>>>>>
>>>>>>>>>>>>> My head spins.
>>>>>>>>>>>>
>>>>>>>>>>>> I follow and raise that the newly proposed 
>>>>>>>>>>>> DROP_IDLE_TIMEOUT applies to DROP_ACTIVE and not only 
>>>>>>>>>>>> DROP_IDLE.
>>>>>>>>>>> My original intention for the name was that is the 'drop 
>>>>>>>>>>> caches timeout for intel_gt_wait_for_idle'. Which is quite 
>>>>>>>>>>> the mouthful and hence abbreviated to DROP_IDLE_TIMEOUT. But 
>>>>>>>>>>> yes, I realised later that name can be conflated with the 
>>>>>>>>>>> DROP_IDLE flag. Will rename.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Things get refactored, code moves around, bits get left 
>>>>>>>>>>>> behind, who knows. No reason to get too worked up. :) As 
>>>>>>>>>>>> long as people are taking a wider view when touching the 
>>>>>>>>>>>> code base, and are not afraid to send cleanups, things 
>>>>>>>>>>>> should be good.
>>>>>>>>>>> On the other hand, if every patch gets blocked in code 
>>>>>>>>>>> review because someone points out some completely unrelated 
>>>>>>>>>>> piece of code could be a bit better then nothing ever gets 
>>>>>>>>>>> fixed. If you spot something that you think should be 
>>>>>>>>>>> improved, isn't the general idea that you should post a 
>>>>>>>>>>> patch yourself to improve it?
>>>>>>>>>>
>>>>>>>>>> There's two maintainers per branch and an order of magnitude 
>>>>>>>>>> or two more developers so it'd be nice if cleanups would just 
>>>>>>>>>> be incoming on self-initiative basis. ;)
>>>>>>>>>>
>>>>>>>>>>>> For the actual functional change at hand - it would be nice 
>>>>>>>>>>>> if code paths in question could handle SIGINT and then we 
>>>>>>>>>>>> could punt the decision on how long someone wants to wait 
>>>>>>>>>>>> purely to userspace. But it's probably hard and it's only 
>>>>>>>>>>>> debugfs so whatever.
>>>>>>>>>>>>
>>>>>>>>>>> The code paths in question will already abort on a signal 
>>>>>>>>>>> won't they? Both intel_gt_wait_for_idle() and 
>>>>>>>>>>> intel_guc_wait_for_pending_msg(), which is where the 
>>>>>>>>>>> uc_wait_for_idle eventually ends up, have an 
>>>>>>>>>>> 'if(signal_pending) return -EINTR;' check. Beyond that, it 
>>>>>>>>>>> sounds like what you are asking for is a change in the IGT 
>>>>>>>>>>> libraries and/or CI framework to start sending signals after 
>>>>>>>>>>> some specific timeout. That seems like a significantly more 
>>>>>>>>>>> complex change (in terms of the number of entities affected 
>>>>>>>>>>> and number of groups involved) and unnecessary.
>>>>>>>>>>
>>>>>>>>>> If you say so, I haven't looked at them all. But if the code 
>>>>>>>>>> path in question already aborts on signals then I am not sure 
>>>>>>>>>> what is the patch fixing? I assumed you are trying to avoid 
>>>>>>>>>> the write stuck in D forever, which then prevents driver 
>>>>>>>>>> unload and everything, requiring the test runner to 
>>>>>>>>>> eventually reboot. If you say SIGINT works then you can 
>>>>>>>>>> already recover from userspace, no?
>>>>>>>>>>
>>>>>>>>>>>> Whether or not 10s is enough CI will hopefully tell us. I'd 
>>>>>>>>>>>> probably err on the side of safety and make it longer, but 
>>>>>>>>>>>> at most half from the test runner timeout.
>>>>>>>>>>> This is supposed to be test clean up. This is not about how 
>>>>>>>>>>> long a particular test takes to complete but about how long 
>>>>>>>>>>> it takes to declare the system broken after the test has 
>>>>>>>>>>> already finished. I would argue that even 10s is massively 
>>>>>>>>>>> longer than required.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I am not convinced that wedging is correct though. 
>>>>>>>>>>>> Conceptually could be just that the timeout is too short. 
>>>>>>>>>>>> What does wedging really give us, on top of limiting the 
>>>>>>>>>>>> wait, when latter AFAIU is the key factor which would 
>>>>>>>>>>>> prevent the need to reboot the machine?
>>>>>>>>>>>>
>>>>>>>>>>> It gives us a system that knows what state it is in. If we 
>>>>>>>>>>> can't idle the GT then something is very badly wrong. 
>>>>>>>>>>> Wedging indicates that. It also ensure that a full GT reset 
>>>>>>>>>>> will be attempted before the next test is run. Helping to 
>>>>>>>>>>> prevent a failure on test X from propagating into failures 
>>>>>>>>>>> of unrelated tests X+1, X+2, ... And if the GT reset does 
>>>>>>>>>>> not work because the system is really that badly broken then 
>>>>>>>>>>> future tests will not run rather than report erroneous 
>>>>>>>>>>> failures.
>>>>>>>>>>>
>>>>>>>>>>> This is not about getting a more stable system for end users 
>>>>>>>>>>> by sweeping issues under the carpet and pretending all is 
>>>>>>>>>>> well. End users don't run IGTs or explicitly call dodgy 
>>>>>>>>>>> debugfs entry points. The sole motivation here is to get 
>>>>>>>>>>> more accurate results from CI. That is, correctly 
>>>>>>>>>>> identifying which test has hit a problem, getting valid 
>>>>>>>>>>> debug analysis for that test (logs and such) and allowing 
>>>>>>>>>>> further testing to complete correctly in the case where the 
>>>>>>>>>>> system can be recovered.
>>>>>>>>>>
>>>>>>>>>> I don't really oppose shortening of the timeout in principle, 
>>>>>>>>>> just want a clear statement if this is something IGT / test 
>>>>>>>>>> runner could already do or not. It can apply a timeout, it 
>>>>>>>>>> can also send SIGINT, and it could even trigger a reset from 
>>>>>>>>>> outside. Sure it is debugfs hacks so general "kernel should 
>>>>>>>>>> not implement policy" need not be strictly followed, but lets 
>>>>>>>>>> have it clear what are the options.
>>>>>>>>>
>>>>>>>>> One conceptual problem with applying this policy is that the 
>>>>>>>>> code is:
>>>>>>>>>
>>>>>>>>>     if (val & (DROP_IDLE | DROP_ACTIVE)) {
>>>>>>>>>         ret = intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
>>>>>>>>>         if (ret)
>>>>>>>>>             return ret;
>>>>>>>>>     }
>>>>>>>>>
>>>>>>>>>     if (val & DROP_IDLE) {
>>>>>>>>>         ret = intel_gt_pm_wait_for_idle(gt);
>>>>>>>>>         if (ret)
>>>>>>>>>             return ret;
>>>>>>>>>     }
>>>>>>>>>
>>>>>>>>> So if someone passes in DROP_IDLE and then why would only the 
>>>>>>>>> first branch have a short timeout and wedge. Yeah some bug 
>>>>>>>>> happens to be there at the moment, but put a bug in a 
>>>>>>>>> different place and you hang on the second branch and then 
>>>>>>>>> need another patch. Versus perhaps making it all respect 
>>>>>>>>> SIGINT and handle from outside.
>>>>>>>>>
>>>>>>>> The pm_wait_for_idle is can only called after gt_wait_for_idle 
>>>>>>>> has completed successfully. There is no route to skip the GT 
>>>>>>>> idle or to do the PM idle even if the GT idle fails. So the 
>>>>>>>> chances of the PM idle failing are greatly reduced. There would 
>>>>>>>> have to be something outside of a GT keeping the GPU awake and 
>>>>>>>> there isn't a whole lot of hardware left at that point!
>>>>>>>
>>>>>>> Well "greatly reduced" is beside my point. Point is today bug is 
>>>>>>> here and we add a timeout, tomorrow bug is there and then the 
>>>>>>> same dance. It can be just a sw bug which forgets to release the 
>>>>>>> pm ref in some circumstances, doesn't really matter.
>>>>>>>
>>>>>> Huh?
>>>>>>
>>>>>> Greatly reduced is the whole point. Today there is a bug and it 
>>>>>> causes a kernel hang which requires the CI framework to reboot 
>>>>>> the system in an extremely unfriendly way which makes it very 
>>>>>> hard to work out what happened. Logs are likely not available. We 
>>>>>> don't even necessarily know which test was being run at the time. 
>>>>>> Etc. So we replace the infinite timeout with a meaningful 
>>>>>> timeout. CI now correctly marks the single test as failing, 
>>>>>> captures all the correct logs, creates a useful bug report and 
>>>>>> continues on testing more stuff.
>>>>>
>>>>> So what is preventing CI to collect logs if IGT is forever stuck 
>>>>> in interruptible wait? Surely it can collect the logs at that 
>>>>> point if the kernel is healthy enough. If it isn't then I don't 
>>>>> see how wedging the GPU will make the kernel any healthier.
>>>>>
>>>>> Is i915 preventing better log collection or could test runner be 
>>>>> improved?
>>>>>
>>>>>> Sure, there is still the chance of hitting an infinite timeout. 
>>>>>> But that one is significantly more complicated to remove. And the 
>>>>>> chances of hitting that one are significantly smaller than the 
>>>>>> chances of hitting the first one.
>>>>>
>>>>> This statement relies on intimate knowledge implementation details 
>>>>> and a bit too much white box testing approach but that's okay, 
>>>>> lets move past this one.
>>>>>
>>>>>> So you are arguing that because I can't fix the last 0.1% of 
>>>>>> possible failures, I am not allowed to fix the first 99.9% of the 
>>>>>> failures?
>>>>>
>>>>> I am clearly not arguing for that. But we are also not talking 
>>>>> about "fixing failures" here. Just how to make CI cope better with 
>>>>> a class of i915 bugs.
>>>>>
>>>>>>>> Regarding signals, the PM idle code ends up at 
>>>>>>>> wait_var_event_killable(). I assume that is interruptible via 
>>>>>>>> at least a KILL signal if not any signal. Although it's not 
>>>>>>>> entirely clear trying to follow through the implementation of 
>>>>>>>> this code. Also, I have no idea if there is a safe way to add a 
>>>>>>>> timeout to that code (or why it wasn't already written with a 
>>>>>>>> timeout included). Someone more familiar with the wakeref 
>>>>>>>> internals would need to comment.
>>>>>>>>
>>>>>>>> However, I strongly disagree that we should not fix the driver 
>>>>>>>> just because it is possible to workaround the issue by 
>>>>>>>> re-writing the CI framework. Feel free to bring a redesign plan 
>>>>>>>> to the IGT WG and whatever equivalent CI meetings in parallel. 
>>>>>>>> But we absolutely should not have infinite waits in the kernel 
>>>>>>>> if there is a trivial way to not have infinite waits.
>>>>>>>
>>>>>>> I thought I was clear that I am not really opposed to the timeout.
>>>>>>>
>>>>>>> The rest of the paragraph I don't really care - point is moot 
>>>>>>> because it's debugfs so we can do whatever, as long as it is not 
>>>>>>> burdensome to i915, which this isn't. If either wasn't the case 
>>>>>>> then we certainly wouldn't be adding any workarounds in the 
>>>>>>> kernel if it can be achieved in IGT.
>>>>>>>
>>>>>>>> Also, sending a signal does not result in the wedge happening. 
>>>>>>>> I specifically did not want to change that code path because I 
>>>>>>>> was assuming there was a valid reason for it. If you have been 
>>>>>>>> interrupted then you are in the territory of maybe it would 
>>>>>>>> have succeeded if you just left it for a moment longer. 
>>>>>>>> Whereas, hitting the timeout says that someone very 
>>>>>>>> deliberately said this is too long to wait and therefore the 
>>>>>>>> system must be broken.
>>>>>>>
>>>>>>> I wanted to know specifically about wedging - why can't you 
>>>>>>> wedge/reset from IGT if DROP_IDLE times out in quiescent or 
>>>>>>> wherever, if that's what you say is the right thing? 
>>>>>> Huh?
>>>>>>
>>>>>> DROP_IDLE has two waits. One that I am trying to change from 
>>>>>> infinite to finite + wedge. One that would take considerable 
>>>>>> effort to change and would be quite invasive to a lot more of the 
>>>>>> driver and which can only be hit if the first timeout actually 
>>>>>> completed successfully and is therefore of less importance 
>>>>>> anyway. Both of those time outs appear to respect signal interrupts.
>>>>>>
>>>>>>> That's a policy decision so why would i915 wedge if an arbitrary 
>>>>>>> timeout expired? I915 is not controlling how much work there is 
>>>>>>> outstanding at the point the IGT decides to call DROP_IDLE.
>>>>>>
>>>>>> Because this is a debug test interface that is used solely by IGT 
>>>>>> after it has finished its testing. This is not about wedging the 
>>>>>> device at some random arbitrary point because an AI compute 
>>>>>> workload takes three hours to complete. This is about a very 
>>>>>> specific test framework cleaning up after testing is completed 
>>>>>> and making sure the test did not fry the system.
>>>>>>
>>>>>> And even if an IGT test was calling DROP_IDLE in the middle of a 
>>>>>> test for some reason, it should not be deliberately pushing 10+ 
>>>>>> seconds of work through and then calling a debug only interface 
>>>>>> to flush it out. If a test wants to verify that the system can 
>>>>>> cope with submitting a minutes worth of rendering and then 
>>>>>> waiting for it to complete then the test should be using official 
>>>>>> channels for that wait.
>>>>>>
>>>>>>>
>>>>>>>> Plus, infinite wait is not a valid code path in the first place 
>>>>>>>> so any change in behaviour is not really a change in behaviour. 
>>>>>>>> Code can't be relying on a kernel call to never return for its 
>>>>>>>> correct operation!
>>>>>>>
>>>>>>> Why infinite wait wouldn't be valid? Then you better change the 
>>>>>>> other one as well. ;P
>>>>>> In what universe is it ever valid to wait forever for a test to 
>>>>>> complete?
>>>>>
>>>>> Well above you claimed both paths respect SIGINT. If that is so 
>>>>> then the wait is as infinite as the IGT wanted it to be.
>>>>>
>>>>>> See above, the PM code would require much more invasive changes. 
>>>>>> This was low hanging fruit. It was supposed to be a two minute 
>>>>>> change to a very self contained section of code that would 
>>>>>> provide significant benefit to debugging a small class of very 
>>>>>> hard to debug problems.
>>>>>
>>>>> Sure, but I'd still like to know why can't you do what you want 
>>>>> from the IGT framework.
>>>>>
>>>>> Have the timeout reduction in i915, again that's fine assuming 10 
>>>>> seconds it enough to not break something by accident.
>>>> CI showed no regressions. And if someone does find a valid reason 
>>>> why a post test drop caches call should legitimately take a 
>>>> stupidly long time then it is easy to track back where the ETIME 
>>>> error came from and bump the timeout.
>>>>
>>>>>
>>>>> With that change you already have broken the "infinite wait". It 
>>>>> makes the debugfs write return -ETIME in time much shorter than 
>>>>> the test runner timeout(s). What is the thing that you cannot do 
>>>>> from IGT at that point is my question? You want to wedge then? 
>>>>> Send DROP_RESET_ACTIVE to do it for you? If that doesn't work add 
>>>>> a new flag which will wedge explicitly.
>>>>>
>>>>> We are again degrading into a huge philosophical discussion and 
>>>>> all I wanted to start with is to hear how exactly things go bad.
>>>>>
>>>> I have no idea what you are wanting. I am trying to have a 
>>>> technical discussion about improving the stability of the driver 
>>>> during CI testing. I have no idea if you are arguing that this 
>>>> change is good, bad, broken, wrong direction or what.
>>>>
>>>> Things go bad as explained in the commit message. The CI framework 
>>>> does not use signals. The IGT framework does not use signals. There 
>>>> is no watchdog that sends a TERM or KILL signal after a specified 
>>>> timeout. All that happens is the IGT sits there forever waiting for 
>>>> the drop caches IOCTL to return. The CI framework eventually gives 
>>>> up waiting for the test to complete and tries to recover. There are 
>>>> many different CI frameworks in use across Intel. Some timeout 
>>>> quickly, some timeout slowly. But basically, they all eventually 
>>>> give up and don't bother trying any kind of remedial action but 
>>>> just hit the reset button (sometimes by literally power cycling the 
>>>> DUT). As result, background processes that are saving dmesg, 
>>>> stdout, etc do not necessarily terminate cleanly. That results in 
>>>> logs that are at best truncated, at worst missing entirely. It also 
>>>> results in some frameworks aborting testing at that point. So no 
>>>> results are generated for all the other tests that have yet to be 
>>>> run. Some frameworks also run tests in batches. All they log is 
>>>> that something, somewhere in the batch died. So you don't even know 
>>>> which specific test actually hit the problem.
>>>>
>>>> Can the CI frameworks be improved? Undoubtedly. In very many ways. 
>>>> Is that something we have the ability to do with a simple patch? 
>>>> No. Would re-writing the IGT framework to add watchdog mechanisms 
>>>> improve things? Yes. Can it be done with a simple patch? No. Would 
>>>> a simple patch to i915 significantly improve the situation? Yes. 
>>>> Will it solve every possible CI hang? No. Will it fix any actual 
>>>> end user visible bugs? No. Will it introduce any new bugs? No. Will 
>>>> it help us to debug at least some CI failures? Yes.
>>>
>>> To unblock, I suggest you go with the patch which caps the wait 
>>> only, and propose a wedging as an IGT patch to gem_quiescent_gpu(). 
>>> That should involve the CI/IGT folks into discussion on what logs 
>>> will be, or will not be collected once gem_quiescent_gpu() fails due 
>>> -ETIME. In fact probably you should copy CI/IGT folks on the v2 of 
>>> the i915 patch as well since I now think their acks would be good to 
>>> have - from the point of view of the current test runner behaviour 
>>> with hanging tests.
>>>
>> Simply returning -ETIME without wedging will actually make the 
>> situation worse. At the moment, you get 'all testing stopped due to 
>> machine not responding' bugs being logged. Which is a right pain and 
>> has very little useful information, but at least is not claiming 
>> random tests are broken when they are not. If you return ETIME 
>> without wedging then test A will 
>
> Several times I asked why can't you wedge from gem_quiescent_gpu() 
> since that is done on driver open. So the chain of failing tests 
> describing below is not relevant to my question.
Actually, no. You have mentioned gem_quiescent_gpu() once and as an IGT 
patch. Which presumably means an entire new API between IGT and i915.

>
> Whole point is why add policy to i915 if it can be done from 
> userspace. Current API is called "wait for idle", not "wait for idle 
> ten seconds max" (although fine since IGT will fail on timeout 
> already), and not "wait for idle or wedge, sometimes".
>
> Yes it's only debugfs, I said that multiple times already, so it could 
> be whatever, but in principle adding code to kernel should always be 
> the 2nd option. Especially since the implementation is only a 50% 
> kludge (I am referring to the 2nd DROP_IDLE branch where the proposal 
> does not add a timeout or wedging). So a half-policy even. Wedge if 
> this stage of DROP_IDLE timed out, but don't wedge if this other stage 
> of DROP_IDLE timed out or failed.
>
> Which is why I was saying, if signals would be respected anyway, why 
> couldn't you do the whole thing in IGT to start with.. "wrap" 
> gem_quiescent_gpu with alarm(2), send SIGINT, wedge, whatever. If that 
> works it would be the same effect. And policy where it belongs, zero 
> kernel code required. If it works.. I haven't checked, you said it 
> would though so what would be wrong with this approach?
Finding someone to do it. If you are familiar with the IGT framework 
internals then feel free. I am not. Whereas, this was a trivial change 
that could improve the situation while having no bad side effects 
(because if the alternative is hanging forever then any change is a good 
change).

>
> And completely separate from the above line of discussion I am not 
> even sure how the "no logs" relates to this all. Sure some bugs result 
> in no logs since kernel crashes so badly. This patch will not improve 
> that.
I never said it would solve every 'missing log' situation. I said it 
would help with the situation where the CI framework times out because 
of one specific class of failures. And in that case it does currently 
reboot with little or no attempt at recovery and therefore little or no 
log capture.

>
> And if the kernel is not badly broken test runner will detect a 
> timeout and sure all further testing will be invalid. It's not the 
> first, or last, or the only way that can happen. There will be logs 
> though so it can be debugged and fixed. (Unless there can't be logs 
> anyway.) So you find the first failing test and fix the issue. How 
> often does it happen anyway.
>
> Or if I totally got this wrong please paste some CI or CIbuglog links 
> to put me on the correct path.
As stated, there are very many bug reports of 'test timed out, 
rebooted'. It is impossible to know exactly how each particular instance 
got into that situation. So no, there is no CI report where I can 
categorically say this is exactly what happened. However, while 
debugging one such issue, I spotted this particular route into that 
situation and realised that it was something that could be trivially fixed.

Except apparently I'm not allowed to. So I give up. I don't have time to 
pursue this any further.

John.

>
> Regards,
>
> Tvrtko
>
>> hang and return ETIME. CI will log an ETIME bug against test A. CI 
>> will then try test B, which will fail with ETIME because the system 
>> is still broken but claiming to be working. So log a new bug against 
>> test B. Move on to test C, oh look, ETIME - log another bug and move 
>> on to test D... That is far worse, a whole slew of pointless and 
>> incorrect bugs have just been logged.
>>
>> And how is it possibly considered a backwards breaking or dangerous 
>> change to wedge instead of hanging forever? Reboot versus wedge. 
>> Absolutely no defined behaviour at all because the system has simply 
>> stopped versus marking the system as broken and having a best effort 
>> at handling the situation. Yup, that's definitely a very dangerous 
>> change that could break all sorts of random user applications.
>>
>> Re 'IGT folks' - whom? Ashutosh had already agreed to the original 
>> patch.
>>
>> And CI folks are certainly aware of such issues. There are any number 
>> of comments in Jiras about 'no logs available, cannot analyse'.
>>
>> John.
>>
>>
>>> Regards,
>>>
>>> Tvrtko
>>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index ae987e92251dd..9d916fbbfc27c 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -641,6 +641,9 @@  DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
 		  DROP_RESET_ACTIVE | \
 		  DROP_RESET_SEQNO | \
 		  DROP_RCU)
+
+#define DROP_IDLE_TIMEOUT	(HZ * 10)
+
 static int
 i915_drop_caches_get(void *data, u64 *val)
 {
@@ -661,7 +664,9 @@  gt_drop_caches(struct intel_gt *gt, u64 val)
 		intel_gt_retire_requests(gt);
 
 	if (val & (DROP_IDLE | DROP_ACTIVE)) {
-		ret = intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
+		ret = intel_gt_wait_for_idle(gt, DROP_IDLE_TIMEOUT);
+		if (ret == -ETIME)
+			intel_gt_set_wedged(gt);
 		if (ret)
 			return ret;
 	}