Message ID | 20201204195007.10215-1-janusz.krzysztofik@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [i-g-t,v2] runner: Don't kill a test on taint if watching timeouts | expand |
On Fri, Dec 04, 2020 at 08:50:07PM +0100, Janusz Krzysztofik wrote: > We may still be interested in results of a test even if it has tainted > the kernel. On the other hand, we need to kill the test on taint if no > other means of killing it on a jam is active. > > If abort on both kernel taint or a timeout is requested, decrease all > potential timeouts significantly while the taint is detected instead of > aborting immediately. However, report the taint as the reason of the > abort if a timeout decreased by the taint expires. > > v2: Fix missing show_kernel_task_state() lost on rebase conflict > resolution (Chris - thanks!) > > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> The effects of this is that we sometimes now get more logs from a test at the cost of it not directly showing up as an incomplete. We would still get the igt@runner@aborted result for it so overall we still catch tainting cases. Chris's comments have been clarified off-list not to mean directly opposing this patch, so Reviewed-by: Petri Latvala <petri.latvala@intel.com> > --- > runner/executor.c | 26 ++++++++++++++++++++------ > 1 file changed, 20 insertions(+), 6 deletions(-) > > diff --git a/runner/executor.c b/runner/executor.c > index 1688ae41d..faf272d85 100644 > --- a/runner/executor.c > +++ b/runner/executor.c > @@ -726,6 +726,8 @@ static const char *need_to_timeout(struct settings *settings, > double time_since_kill, > size_t disk_usage) > { > + int decrease = 1; > + > if (killed) { > /* > * Timeout after being killed is a hardcoded amount > @@ -753,20 +755,32 @@ static const char *need_to_timeout(struct settings *settings, > } > > /* > - * If we're configured to care about taints, kill the > - * test if there's a taint. > + * If we're configured to care about taints, > + * decrease timeouts in use if there's a taint, > + * or kill the test if no timeouts have been requested. > */ > if (settings->abort_mask & ABORT_TAINT && > - is_tainted(taints)) > - return "Killing the test because the kernel is tainted.\n"; > + is_tainted(taints)) { > + /* list of timeouts that may postpone immediate kill on taint */ > + if (settings->per_test_timeout || settings->inactivity_timeout) > + decrease = 10; > + else > + return "Killing the test because the kernel is tainted.\n"; > + } > > if (settings->per_test_timeout != 0 && > - time_since_subtest > settings->per_test_timeout) > + time_since_subtest > settings->per_test_timeout / decrease) { > + if (decrease > 1) > + return "Killing the test because the kernel is tainted.\n"; > return show_kernel_task_state("Per-test timeout exceeded. Killing the current test with SIGQUIT.\n"); > + } > > if (settings->inactivity_timeout != 0 && > - time_since_activity > settings->inactivity_timeout) > + time_since_activity > settings->inactivity_timeout / decrease ) { > + if (decrease > 1) > + return "Killing the test because the kernel is tainted.\n"; > return show_kernel_task_state("Inactivity timeout exceeded. Killing the current test with SIGQUIT.\n"); > + } > > if (disk_usage_limit_exceeded(settings, disk_usage)) > return "Disk usage limit exceeded.\n"; > -- > 2.21.1 >
On Mon, 2020-12-07 at 15:09 +0200, Petri Latvala wrote: > On Fri, Dec 04, 2020 at 08:50:07PM +0100, Janusz Krzysztofik wrote: > > We may still be interested in results of a test even if it has tainted > > the kernel. On the other hand, we need to kill the test on taint if no > > other means of killing it on a jam is active. > > > > If abort on both kernel taint or a timeout is requested, decrease all > > potential timeouts significantly while the taint is detected instead of > > aborting immediately. However, report the taint as the reason of the > > abort if a timeout decreased by the taint expires. > > > > v2: Fix missing show_kernel_task_state() lost on rebase conflict > > resolution (Chris - thanks!) > > > > Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> > > The effects of this is that we sometimes now get more logs from a test > at the cost of it not directly showing up as an incomplete. We would > still get the igt@runner@aborted result for it so overall we still > catch tainting cases. > > Chris's comments have been clarified off-list not to mean directly > opposing this patch, so > > > Reviewed-by: Petri Latvala <petri.latvala@intel.com> Thanks, pushed. Janusz > > > > > --- > > runner/executor.c | 26 ++++++++++++++++++++------ > > 1 file changed, 20 insertions(+), 6 deletions(-) > > > > diff --git a/runner/executor.c b/runner/executor.c > > index 1688ae41d..faf272d85 100644 > > --- a/runner/executor.c > > +++ b/runner/executor.c > > @@ -726,6 +726,8 @@ static const char *need_to_timeout(struct settings *settings, > > double time_since_kill, > > size_t disk_usage) > > { > > + int decrease = 1; > > + > > if (killed) { > > /* > > * Timeout after being killed is a hardcoded amount > > @@ -753,20 +755,32 @@ static const char *need_to_timeout(struct settings *settings, > > } > > > > /* > > - * If we're configured to care about taints, kill the > > - * test if there's a taint. > > + * If we're configured to care about taints, > > + * decrease timeouts in use if there's a taint, > > + * or kill the test if no timeouts have been requested. > > */ > > if (settings->abort_mask & ABORT_TAINT && > > - is_tainted(taints)) > > - return "Killing the test because the kernel is tainted.\n"; > > + is_tainted(taints)) { > > + /* list of timeouts that may postpone immediate kill on taint */ > > + if (settings->per_test_timeout || settings->inactivity_timeout) > > + decrease = 10; > > + else > > + return "Killing the test because the kernel is tainted.\n"; > > + } > > > > if (settings->per_test_timeout != 0 && > > - time_since_subtest > settings->per_test_timeout) > > + time_since_subtest > settings->per_test_timeout / decrease) { > > + if (decrease > 1) > > + return "Killing the test because the kernel is tainted.\n"; > > return show_kernel_task_state("Per-test timeout exceeded. Killing the current test with SIGQUIT.\n"); > > + } > > > > if (settings->inactivity_timeout != 0 && > > - time_since_activity > settings->inactivity_timeout) > > + time_since_activity > settings->inactivity_timeout / decrease ) { > > + if (decrease > 1) > > + return "Killing the test because the kernel is tainted.\n"; > > return show_kernel_task_state("Inactivity timeout exceeded. Killing the current test with SIGQUIT.\n"); > > + } > > > > if (disk_usage_limit_exceeded(settings, disk_usage)) > > return "Disk usage limit exceeded.\n"; > > -- > > 2.21.1 > >
Quoting Janusz Krzysztofik (2020-12-04 19:50:07) > We may still be interested in results of a test even if it has tainted > the kernel. On the other hand, we need to kill the test on taint if no > other means of killing it on a jam is active. > > If abort on both kernel taint or a timeout is requested, decrease all > potential timeouts significantly while the taint is detected instead of > aborting immediately. However, report the taint as the reason of the > abort if a timeout decreased by the taint expires. This has the nasty side effect of not stopping the test run after a kernel taint. Instead the next test inherits the tainted condition from the previous test and usually ends up being declared incomplete. False positives are frustrating. -Chris
On Wed, Jan 06, 2021 at 09:41:37AM +0000, Chris Wilson wrote: > Quoting Janusz Krzysztofik (2020-12-04 19:50:07) > > We may still be interested in results of a test even if it has tainted > > the kernel. On the other hand, we need to kill the test on taint if no > > other means of killing it on a jam is active. > > > > If abort on both kernel taint or a timeout is requested, decrease all > > potential timeouts significantly while the taint is detected instead of > > aborting immediately. However, report the taint as the reason of the > > abort if a timeout decreased by the taint expires. > > This has the nasty side effect of not stopping the test run after a > kernel taint. Instead the next test inherits the tainted condition from > the previous test and usually ends up being declared incomplete. > > False positives are frustrating. > -Chris Do you have a link to a test run where this happened? This patch didn't change the between-tests abort checks.
Quoting Petri Latvala (2021-01-07 09:40:02) > On Wed, Jan 06, 2021 at 09:41:37AM +0000, Chris Wilson wrote: > > Quoting Janusz Krzysztofik (2020-12-04 19:50:07) > > > We may still be interested in results of a test even if it has tainted > > > the kernel. On the other hand, we need to kill the test on taint if no > > > other means of killing it on a jam is active. > > > > > > If abort on both kernel taint or a timeout is requested, decrease all > > > potential timeouts significantly while the taint is detected instead of > > > aborting immediately. However, report the taint as the reason of the > > > abort if a timeout decreased by the taint expires. > > > > This has the nasty side effect of not stopping the test run after a > > kernel taint. Instead the next test inherits the tainted condition from > > the previous test and usually ends up being declared incomplete. > > > > False positives are frustrating. > > -Chris > > > Do you have a link to a test run where this happened? This patch > didn't change the between-tests abort checks. The taint is from the warnings in the penultimate test: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9550/shard-skl7/igt_runner20.txt -Chris
On Thu, Jan 07, 2021 at 09:49:23AM +0000, Chris Wilson wrote: > Quoting Petri Latvala (2021-01-07 09:40:02) > > On Wed, Jan 06, 2021 at 09:41:37AM +0000, Chris Wilson wrote: > > > Quoting Janusz Krzysztofik (2020-12-04 19:50:07) > > > > We may still be interested in results of a test even if it has tainted > > > > the kernel. On the other hand, we need to kill the test on taint if no > > > > other means of killing it on a jam is active. > > > > > > > > If abort on both kernel taint or a timeout is requested, decrease all > > > > potential timeouts significantly while the taint is detected instead of > > > > aborting immediately. However, report the taint as the reason of the > > > > abort if a timeout decreased by the taint expires. > > > > > > This has the nasty side effect of not stopping the test run after a > > > kernel taint. Instead the next test inherits the tainted condition from > > > the previous test and usually ends up being declared incomplete. > > > > > > False positives are frustrating. > > > -Chris > > > > > > Do you have a link to a test run where this happened? This patch > > didn't change the between-tests abort checks. > > The taint is from the warnings in the penultimate test: > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9550/shard-skl7/igt_runner20.txt Ah, I see. https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_9550/shard-skl7/dmesg20.txt This is the tainting WARN I presume: <4>[ 917.575173] Memory manager not clean during takedown. <4>[ 917.575272] WARNING: CPU: 2 PID: 7 at drivers/gpu/drm/drm_mm.c:999 drm_mm_takedown+0x51/0x100 That happens after @gem, before @evict. In other words, this is all in the same exec() of i915_selftest --run-subtest live. Incorrect _dynamic_ subtest gets blamed. Getting the killing right here is a bit tricky, possibly doable. Or rather, getting the blame right is doable, killing will be inherently racy...
diff --git a/runner/executor.c b/runner/executor.c index 1688ae41d..faf272d85 100644 --- a/runner/executor.c +++ b/runner/executor.c @@ -726,6 +726,8 @@ static const char *need_to_timeout(struct settings *settings, double time_since_kill, size_t disk_usage) { + int decrease = 1; + if (killed) { /* * Timeout after being killed is a hardcoded amount @@ -753,20 +755,32 @@ static const char *need_to_timeout(struct settings *settings, } /* - * If we're configured to care about taints, kill the - * test if there's a taint. + * If we're configured to care about taints, + * decrease timeouts in use if there's a taint, + * or kill the test if no timeouts have been requested. */ if (settings->abort_mask & ABORT_TAINT && - is_tainted(taints)) - return "Killing the test because the kernel is tainted.\n"; + is_tainted(taints)) { + /* list of timeouts that may postpone immediate kill on taint */ + if (settings->per_test_timeout || settings->inactivity_timeout) + decrease = 10; + else + return "Killing the test because the kernel is tainted.\n"; + } if (settings->per_test_timeout != 0 && - time_since_subtest > settings->per_test_timeout) + time_since_subtest > settings->per_test_timeout / decrease) { + if (decrease > 1) + return "Killing the test because the kernel is tainted.\n"; return show_kernel_task_state("Per-test timeout exceeded. Killing the current test with SIGQUIT.\n"); + } if (settings->inactivity_timeout != 0 && - time_since_activity > settings->inactivity_timeout) + time_since_activity > settings->inactivity_timeout / decrease ) { + if (decrease > 1) + return "Killing the test because the kernel is tainted.\n"; return show_kernel_task_state("Inactivity timeout exceeded. Killing the current test with SIGQUIT.\n"); + } if (disk_usage_limit_exceeded(settings, disk_usage)) return "Disk usage limit exceeded.\n";
We may still be interested in results of a test even if it has tainted the kernel. On the other hand, we need to kill the test on taint if no other means of killing it on a jam is active. If abort on both kernel taint or a timeout is requested, decrease all potential timeouts significantly while the taint is detected instead of aborting immediately. However, report the taint as the reason of the abort if a timeout decreased by the taint expires. v2: Fix missing show_kernel_task_state() lost on rebase conflict resolution (Chris - thanks!) Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com> --- runner/executor.c | 26 ++++++++++++++++++++------ 1 file changed, 20 insertions(+), 6 deletions(-)