diff mbox series

[i-g-t,v2] runner: Don't kill a test on taint if watching timeouts

Message ID 20201204195007.10215-1-janusz.krzysztofik@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series [i-g-t,v2] runner: Don't kill a test on taint if watching timeouts | expand

Commit Message

Janusz Krzysztofik Dec. 4, 2020, 7:50 p.m. UTC
We may still be interested in results of a test even if it has tainted
the kernel.  On the other hand, we need to kill the test on taint if no
other means of killing it on a jam is active.

If abort on both kernel taint or a timeout is requested, decrease all
potential timeouts significantly while the taint is detected instead of
aborting immediately.  However, report the taint as the reason of the
abort if a timeout decreased by the taint expires.

v2: Fix missing show_kernel_task_state() lost on rebase conflict
    resolution (Chris - thanks!)

Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
---
 runner/executor.c | 26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

Comments

Petri Latvala Dec. 7, 2020, 1:09 p.m. UTC | #1
On Fri, Dec 04, 2020 at 08:50:07PM +0100, Janusz Krzysztofik wrote:
> We may still be interested in results of a test even if it has tainted
> the kernel.  On the other hand, we need to kill the test on taint if no
> other means of killing it on a jam is active.
> 
> If abort on both kernel taint or a timeout is requested, decrease all
> potential timeouts significantly while the taint is detected instead of
> aborting immediately.  However, report the taint as the reason of the
> abort if a timeout decreased by the taint expires.
> 
> v2: Fix missing show_kernel_task_state() lost on rebase conflict
>     resolution (Chris - thanks!)
> 
> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>


The effects of this is that we sometimes now get more logs from a test
at the cost of it not directly showing up as an incomplete. We would
still get the igt@runner@aborted result for it so overall we still
catch tainting cases.

Chris's comments have been clarified off-list not to mean directly
opposing this patch, so


Reviewed-by: Petri Latvala <petri.latvala@intel.com>



> ---
>  runner/executor.c | 26 ++++++++++++++++++++------
>  1 file changed, 20 insertions(+), 6 deletions(-)
> 
> diff --git a/runner/executor.c b/runner/executor.c
> index 1688ae41d..faf272d85 100644
> --- a/runner/executor.c
> +++ b/runner/executor.c
> @@ -726,6 +726,8 @@ static const char *need_to_timeout(struct settings *settings,
>  				   double time_since_kill,
>  				   size_t disk_usage)
>  {
> +	int decrease = 1;
> +
>  	if (killed) {
>  		/*
>  		 * Timeout after being killed is a hardcoded amount
> @@ -753,20 +755,32 @@ static const char *need_to_timeout(struct settings *settings,
>  	}
>  
>  	/*
> -	 * If we're configured to care about taints, kill the
> -	 * test if there's a taint.
> +	 * If we're configured to care about taints,
> +	 * decrease timeouts in use if there's a taint,
> +	 * or kill the test if no timeouts have been requested.
>  	 */
>  	if (settings->abort_mask & ABORT_TAINT &&
> -	    is_tainted(taints))
> -		return "Killing the test because the kernel is tainted.\n";
> +	    is_tainted(taints)) {
> +		/* list of timeouts that may postpone immediate kill on taint */
> +		if (settings->per_test_timeout || settings->inactivity_timeout)
> +			decrease = 10;
> +		else
> +			return "Killing the test because the kernel is tainted.\n";
> +	}
>  
>  	if (settings->per_test_timeout != 0 &&
> -	    time_since_subtest > settings->per_test_timeout)
> +	    time_since_subtest > settings->per_test_timeout / decrease) {
> +		if (decrease > 1)
> +			return "Killing the test because the kernel is tainted.\n";
>  		return show_kernel_task_state("Per-test timeout exceeded. Killing the current test with SIGQUIT.\n");
> +	}
>  
>  	if (settings->inactivity_timeout != 0 &&
> -	    time_since_activity > settings->inactivity_timeout)
> +	    time_since_activity > settings->inactivity_timeout / decrease ) {
> +		if (decrease > 1)
> +			return "Killing the test because the kernel is tainted.\n";
>  		return show_kernel_task_state("Inactivity timeout exceeded. Killing the current test with SIGQUIT.\n");
> +	}
>  
>  	if (disk_usage_limit_exceeded(settings, disk_usage))
>  		return "Disk usage limit exceeded.\n";
> -- 
> 2.21.1
>
Janusz Krzysztofik Dec. 7, 2020, 1:18 p.m. UTC | #2
On Mon, 2020-12-07 at 15:09 +0200, Petri Latvala wrote:
> On Fri, Dec 04, 2020 at 08:50:07PM +0100, Janusz Krzysztofik wrote:
> > We may still be interested in results of a test even if it has tainted
> > the kernel.  On the other hand, we need to kill the test on taint if no
> > other means of killing it on a jam is active.
> > 
> > If abort on both kernel taint or a timeout is requested, decrease all
> > potential timeouts significantly while the taint is detected instead of
> > aborting immediately.  However, report the taint as the reason of the
> > abort if a timeout decreased by the taint expires.
> > 
> > v2: Fix missing show_kernel_task_state() lost on rebase conflict
> >     resolution (Chris - thanks!)
> > 
> > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
> 
> The effects of this is that we sometimes now get more logs from a test
> at the cost of it not directly showing up as an incomplete. We would
> still get the igt@runner@aborted result for it so overall we still
> catch tainting cases.
> 
> Chris's comments have been clarified off-list not to mean directly
> opposing this patch, so
> 
> 
> Reviewed-by: Petri Latvala <petri.latvala@intel.com>

Thanks, pushed.

Janusz

> 
> 
> 
> > ---
> >  runner/executor.c | 26 ++++++++++++++++++++------
> >  1 file changed, 20 insertions(+), 6 deletions(-)
> > 
> > diff --git a/runner/executor.c b/runner/executor.c
> > index 1688ae41d..faf272d85 100644
> > --- a/runner/executor.c
> > +++ b/runner/executor.c
> > @@ -726,6 +726,8 @@ static const char *need_to_timeout(struct settings *settings,
> >  				   double time_since_kill,
> >  				   size_t disk_usage)
> >  {
> > +	int decrease = 1;
> > +
> >  	if (killed) {
> >  		/*
> >  		 * Timeout after being killed is a hardcoded amount
> > @@ -753,20 +755,32 @@ static const char *need_to_timeout(struct settings *settings,
> >  	}
> >  
> >  	/*
> > -	 * If we're configured to care about taints, kill the
> > -	 * test if there's a taint.
> > +	 * If we're configured to care about taints,
> > +	 * decrease timeouts in use if there's a taint,
> > +	 * or kill the test if no timeouts have been requested.
> >  	 */
> >  	if (settings->abort_mask & ABORT_TAINT &&
> > -	    is_tainted(taints))
> > -		return "Killing the test because the kernel is tainted.\n";
> > +	    is_tainted(taints)) {
> > +		/* list of timeouts that may postpone immediate kill on taint */
> > +		if (settings->per_test_timeout || settings->inactivity_timeout)
> > +			decrease = 10;
> > +		else
> > +			return "Killing the test because the kernel is tainted.\n";
> > +	}
> >  
> >  	if (settings->per_test_timeout != 0 &&
> > -	    time_since_subtest > settings->per_test_timeout)
> > +	    time_since_subtest > settings->per_test_timeout / decrease) {
> > +		if (decrease > 1)
> > +			return "Killing the test because the kernel is tainted.\n";
> >  		return show_kernel_task_state("Per-test timeout exceeded. Killing the current test with SIGQUIT.\n");
> > +	}
> >  
> >  	if (settings->inactivity_timeout != 0 &&
> > -	    time_since_activity > settings->inactivity_timeout)
> > +	    time_since_activity > settings->inactivity_timeout / decrease ) {
> > +		if (decrease > 1)
> > +			return "Killing the test because the kernel is tainted.\n";
> >  		return show_kernel_task_state("Inactivity timeout exceeded. Killing the current test with SIGQUIT.\n");
> > +	}
> >  
> >  	if (disk_usage_limit_exceeded(settings, disk_usage))
> >  		return "Disk usage limit exceeded.\n";
> > -- 
> > 2.21.1
> >
Chris Wilson Jan. 6, 2021, 9:41 a.m. UTC | #3
Quoting Janusz Krzysztofik (2020-12-04 19:50:07)
> We may still be interested in results of a test even if it has tainted
> the kernel.  On the other hand, we need to kill the test on taint if no
> other means of killing it on a jam is active.
> 
> If abort on both kernel taint or a timeout is requested, decrease all
> potential timeouts significantly while the taint is detected instead of
> aborting immediately.  However, report the taint as the reason of the
> abort if a timeout decreased by the taint expires.

This has the nasty side effect of not stopping the test run after a
kernel taint. Instead the next test inherits the tainted condition from
the previous test and usually ends up being declared incomplete.

False positives are frustrating.
-Chris
Petri Latvala Jan. 7, 2021, 9:40 a.m. UTC | #4
On Wed, Jan 06, 2021 at 09:41:37AM +0000, Chris Wilson wrote:
> Quoting Janusz Krzysztofik (2020-12-04 19:50:07)
> > We may still be interested in results of a test even if it has tainted
> > the kernel.  On the other hand, we need to kill the test on taint if no
> > other means of killing it on a jam is active.
> > 
> > If abort on both kernel taint or a timeout is requested, decrease all
> > potential timeouts significantly while the taint is detected instead of
> > aborting immediately.  However, report the taint as the reason of the
> > abort if a timeout decreased by the taint expires.
> 
> This has the nasty side effect of not stopping the test run after a
> kernel taint. Instead the next test inherits the tainted condition from
> the previous test and usually ends up being declared incomplete.
> 
> False positives are frustrating.
> -Chris


Do you have a link to a test run where this happened? This patch
didn't change the between-tests abort checks.
Chris Wilson Jan. 7, 2021, 9:49 a.m. UTC | #5
Quoting Petri Latvala (2021-01-07 09:40:02)
> On Wed, Jan 06, 2021 at 09:41:37AM +0000, Chris Wilson wrote:
> > Quoting Janusz Krzysztofik (2020-12-04 19:50:07)
> > > We may still be interested in results of a test even if it has tainted
> > > the kernel.  On the other hand, we need to kill the test on taint if no
> > > other means of killing it on a jam is active.
> > > 
> > > If abort on both kernel taint or a timeout is requested, decrease all
> > > potential timeouts significantly while the taint is detected instead of
> > > aborting immediately.  However, report the taint as the reason of the
> > > abort if a timeout decreased by the taint expires.
> > 
> > This has the nasty side effect of not stopping the test run after a
> > kernel taint. Instead the next test inherits the tainted condition from
> > the previous test and usually ends up being declared incomplete.
> > 
> > False positives are frustrating.
> > -Chris
> 
> 
> Do you have a link to a test run where this happened? This patch
> didn't change the between-tests abort checks.

The taint is from the warnings in the penultimate test:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9550/shard-skl7/igt_runner20.txt
-Chris
Petri Latvala Jan. 7, 2021, 10:06 a.m. UTC | #6
On Thu, Jan 07, 2021 at 09:49:23AM +0000, Chris Wilson wrote:
> Quoting Petri Latvala (2021-01-07 09:40:02)
> > On Wed, Jan 06, 2021 at 09:41:37AM +0000, Chris Wilson wrote:
> > > Quoting Janusz Krzysztofik (2020-12-04 19:50:07)
> > > > We may still be interested in results of a test even if it has tainted
> > > > the kernel.  On the other hand, we need to kill the test on taint if no
> > > > other means of killing it on a jam is active.
> > > > 
> > > > If abort on both kernel taint or a timeout is requested, decrease all
> > > > potential timeouts significantly while the taint is detected instead of
> > > > aborting immediately.  However, report the taint as the reason of the
> > > > abort if a timeout decreased by the taint expires.
> > > 
> > > This has the nasty side effect of not stopping the test run after a
> > > kernel taint. Instead the next test inherits the tainted condition from
> > > the previous test and usually ends up being declared incomplete.
> > > 
> > > False positives are frustrating.
> > > -Chris
> > 
> > 
> > Do you have a link to a test run where this happened? This patch
> > didn't change the between-tests abort checks.
> 
> The taint is from the warnings in the penultimate test:
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9550/shard-skl7/igt_runner20.txt

Ah, I see.

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9550/shard-skl7/dmesg20.txt

This is the tainting WARN I presume:

<4>[  917.575173] Memory manager not clean during takedown.
<4>[  917.575272] WARNING: CPU: 2 PID: 7 at drivers/gpu/drm/drm_mm.c:999 drm_mm_takedown+0x51/0x100

That happens after @gem, before @evict.

In other words, this is all in the same exec() of i915_selftest
--run-subtest live. Incorrect _dynamic_ subtest gets blamed.

Getting the killing right here is a bit tricky, possibly doable. Or
rather, getting the blame right is doable, killing will be inherently
racy...
diff mbox series

Patch

diff --git a/runner/executor.c b/runner/executor.c
index 1688ae41d..faf272d85 100644
--- a/runner/executor.c
+++ b/runner/executor.c
@@ -726,6 +726,8 @@  static const char *need_to_timeout(struct settings *settings,
 				   double time_since_kill,
 				   size_t disk_usage)
 {
+	int decrease = 1;
+
 	if (killed) {
 		/*
 		 * Timeout after being killed is a hardcoded amount
@@ -753,20 +755,32 @@  static const char *need_to_timeout(struct settings *settings,
 	}
 
 	/*
-	 * If we're configured to care about taints, kill the
-	 * test if there's a taint.
+	 * If we're configured to care about taints,
+	 * decrease timeouts in use if there's a taint,
+	 * or kill the test if no timeouts have been requested.
 	 */
 	if (settings->abort_mask & ABORT_TAINT &&
-	    is_tainted(taints))
-		return "Killing the test because the kernel is tainted.\n";
+	    is_tainted(taints)) {
+		/* list of timeouts that may postpone immediate kill on taint */
+		if (settings->per_test_timeout || settings->inactivity_timeout)
+			decrease = 10;
+		else
+			return "Killing the test because the kernel is tainted.\n";
+	}
 
 	if (settings->per_test_timeout != 0 &&
-	    time_since_subtest > settings->per_test_timeout)
+	    time_since_subtest > settings->per_test_timeout / decrease) {
+		if (decrease > 1)
+			return "Killing the test because the kernel is tainted.\n";
 		return show_kernel_task_state("Per-test timeout exceeded. Killing the current test with SIGQUIT.\n");
+	}
 
 	if (settings->inactivity_timeout != 0 &&
-	    time_since_activity > settings->inactivity_timeout)
+	    time_since_activity > settings->inactivity_timeout / decrease ) {
+		if (decrease > 1)
+			return "Killing the test because the kernel is tainted.\n";
 		return show_kernel_task_state("Inactivity timeout exceeded. Killing the current test with SIGQUIT.\n");
+	}
 
 	if (disk_usage_limit_exceeded(settings, disk_usage))
 		return "Disk usage limit exceeded.\n";