diff mbox series

[i-g-t,v2] i915/pm_rps: install SIGTERM handler for load_helper process

Message ID 20191120062912.10853-1-chuansheng.liu@intel.com (mailing list archive)
State New, archived
Headers show
Series [i-g-t,v2] i915/pm_rps: install SIGTERM handler for load_helper process | expand

Commit Message

Chuansheng Liu Nov. 20, 2019, 6:29 a.m. UTC
Reference:
https://bugs.freedesktop.org/show_bug.cgi?id=112126

The issue we hit is the GPU keeps very high load after running
the subtest min-max-config-loaded.

Some background of the issue:
Currently the rps is not fully enabled yet on TGL, and running
the subtest min-max-config-loaded will hit below assertion:
==
(i915_pm_rps:1261) CRITICAL: Test assertion failure function loaded_check, file ../tests/i915/i915_pm_rps.c:505:
(i915_pm_rps:1261) CRITICAL: Failed assertion: freqs[MAX] <= freqs[CUR]
(i915_pm_rps:1261) CRITICAL: Last errno: 2, No such file or directory
==

with igt stress test, we find the GT keeps busy after running
this subtest, it is due to the igt_spin_end() is not called
randomly.

The root cause analysis is:
When the main process i915_pm_rps for running the subtest
min-max-config-loaded hits the assertion, the main process will
try to send signal SIGTERM to the child process loader_helper
which is created by main process for starting GT load, then the
main process itself will exit.

The SIGTERM handler for loader_helper is the default one, which
will cause the loader_helper exits directly. That is unsafe, we
always expect the igt_spin_end() is called before loader_helper
process exits, which is used to stop the load of GT.

Furthermore, in normal scenario, before main process exits,
it will send SIGUSR1 to child process for stopping GT loading
in safe way.

So here we install the proper handler for signal SIGTERM in the
similar way. Without this patch, the GT may keep busy after
running this subtest. Enabling rps should be tracked on the
other side.

Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
---
 tests/i915/i915_pm_rps.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Chris Wilson Nov. 20, 2019, 12:30 p.m. UTC | #1
Quoting Chuansheng Liu (2019-11-20 06:29:12)
> Reference:
> https://bugs.freedesktop.org/show_bug.cgi?id=112126
> 
> The issue we hit is the GPU keeps very high load after running
> the subtest min-max-config-loaded.
> 
> Some background of the issue:
> Currently the rps is not fully enabled yet on TGL, and running
> the subtest min-max-config-loaded will hit below assertion:
> ==
> (i915_pm_rps:1261) CRITICAL: Test assertion failure function loaded_check, file ../tests/i915/i915_pm_rps.c:505:
> (i915_pm_rps:1261) CRITICAL: Failed assertion: freqs[MAX] <= freqs[CUR]
> (i915_pm_rps:1261) CRITICAL: Last errno: 2, No such file or directory
> ==
> 
> with igt stress test, we find the GT keeps busy after running
> this subtest, it is due to the igt_spin_end() is not called
> randomly.
> 
> The root cause analysis is:
> When the main process i915_pm_rps for running the subtest
> min-max-config-loaded hits the assertion, the main process will
> try to send signal SIGTERM to the child process loader_helper
> which is created by main process for starting GT load, then the
> main process itself will exit.
> 
> The SIGTERM handler for loader_helper is the default one, which
> will cause the loader_helper exits directly. That is unsafe, we
> always expect the igt_spin_end() is called before loader_helper
> process exits, which is used to stop the load of GT.
> 
> Furthermore, in normal scenario, before main process exits,
> it will send SIGUSR1 to child process for stopping GT loading
> in safe way.
> 
> So here we install the proper handler for signal SIGTERM in the
> similar way. Without this patch, the GT may keep busy after
> running this subtest. Enabling rps should be tracked on the
> other side.
> 
> Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
> ---
>  tests/i915/i915_pm_rps.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/tests/i915/i915_pm_rps.c b/tests/i915/i915_pm_rps.c
> index ef627c0b..8c71c1a1 100644
> --- a/tests/i915/i915_pm_rps.c
> +++ b/tests/i915/i915_pm_rps.c
> @@ -252,6 +252,7 @@ static void load_helper_run(enum load load)
>  
>                 signal(SIGUSR1, load_helper_signal_handler);
>                 signal(SIGUSR2, load_helper_signal_handler);
> +               signal(SIGTERM, load_helper_signal_handler);

I don't see any behaviour changes to igt to cause it to send SIGTERM on
exit_subtest.

But you might as well just s/SIGUSR2/SIGTERM/ for clearer and common
intentions.
-Chris
Chuansheng Liu Nov. 21, 2019, 1:34 a.m. UTC | #2
Thanks for reviewing the patch, please see below comments.

> > So here we install the proper handler for signal SIGTERM in the
> > similar way. Without this patch, the GT may keep busy after
> > running this subtest. Enabling rps should be tracked on the
> > other side.
> >
> > Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
> > ---
> >  tests/i915/i915_pm_rps.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/tests/i915/i915_pm_rps.c b/tests/i915/i915_pm_rps.c
> > index ef627c0b..8c71c1a1 100644
> > --- a/tests/i915/i915_pm_rps.c
> > +++ b/tests/i915/i915_pm_rps.c
> > @@ -252,6 +252,7 @@ static void load_helper_run(enum load load)
> >
> >                 signal(SIGUSR1, load_helper_signal_handler);
> >                 signal(SIGUSR2, load_helper_signal_handler);
> > +               signal(SIGTERM, load_helper_signal_handler);
> 
> I don't see any behaviour changes to igt to cause it to send SIGTERM on
> exit_subtest.

Yes, exit_subtest() will not send SIGTERM out. But when main process calls
igt_exit() to exit, it hits the below assertion, then goes to fatal_sig_handler() with SIGABORT.
(i915_pm_rps:1680) igt_core-CRITICAL: Exiting with status code 98
i915_pm_rps: ../lib/igt_core.c:1775: igt_exit: Assertion `waitpid(-1, &tmp, WNOHANG) == -1 && errno == ECHILD' failed.
Received signal SIGABRT.

In fatal_sig_handler(), the installed exit handler fork_helper_exit_handler()
will send out the SIGTERM to all children process.

> 
> But you might as well just s/SIGUSR2/SIGTERM/ for clearer and common
> intentions.
Don't get your real point, SIGUSR1 is for actively stopping load_helper, SIGUSR2 is for
switching high and low load, the SIGTERM is for passively exiting.
Chris Wilson Nov. 21, 2019, 7:47 a.m. UTC | #3
Quoting Liu, Chuansheng (2019-11-21 01:34:24)
> Thanks for reviewing the patch, please see below comments.
> 
> > > So here we install the proper handler for signal SIGTERM in the
> > > similar way. Without this patch, the GT may keep busy after
> > > running this subtest. Enabling rps should be tracked on the
> > > other side.
> > >
> > > Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
> > > ---
> > >  tests/i915/i915_pm_rps.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > >
> > > diff --git a/tests/i915/i915_pm_rps.c b/tests/i915/i915_pm_rps.c
> > > index ef627c0b..8c71c1a1 100644
> > > --- a/tests/i915/i915_pm_rps.c
> > > +++ b/tests/i915/i915_pm_rps.c
> > > @@ -252,6 +252,7 @@ static void load_helper_run(enum load load)
> > >
> > >                 signal(SIGUSR1, load_helper_signal_handler);
> > >                 signal(SIGUSR2, load_helper_signal_handler);
> > > +               signal(SIGTERM, load_helper_signal_handler);
> > 
> > I don't see any behaviour changes to igt to cause it to send SIGTERM on
> > exit_subtest.
> 
> Yes, exit_subtest() will not send SIGTERM out. But when main process calls
> igt_exit() to exit, it hits the below assertion, then goes to fatal_sig_handler() with SIGABORT.
> (i915_pm_rps:1680) igt_core-CRITICAL: Exiting with status code 98
> i915_pm_rps: ../lib/igt_core.c:1775: igt_exit: Assertion `waitpid(-1, &tmp, WNOHANG) == -1 && errno == ECHILD' failed.
> Received signal SIGABRT.

Ok, but that's not a huge concern, since we are already in an error state.
My concern is for fixing whatever got us into that state.

> In fatal_sig_handler(), the installed exit handler fork_helper_exit_handler()
> will send out the SIGTERM to all children process.
> 
> > 
> > But you might as well just s/SIGUSR2/SIGTERM/ for clearer and common
> > intentions.
> Don't get your real point, SIGUSR1 is for actively stopping load_helper, SIGUSR2 is for
> switching high and low load, the SIGTERM is for passively exiting.

I think the design of having a persistent helper process that leaks
between subtests is broken. Then using three signals for essentially only
2 commands is aesthetically unpleasing.
-Chris
Chuansheng Liu Nov. 21, 2019, 8:19 a.m. UTC | #4
> -----Original Message-----
> From: Chris Wilson <chris@chris-wilson.co.uk>
> Sent: Thursday, November 21, 2019 3:47 PM
> To: Liu, Chuansheng <chuansheng.liu@intel.com>;
> igt-dev@lists.freedesktop.org
> Cc: intel-gfx@lists.freedesktop.org
> Subject: RE: [Intel-gfx] [PATCH i-g-t v2] i915/pm_rps: install SIGTERM handler
> for load_helper process
> 
> Quoting Liu, Chuansheng (2019-11-21 01:34:24)
> > Thanks for reviewing the patch, please see below comments.
> >
> > > > So here we install the proper handler for signal SIGTERM in the
> > > > similar way. Without this patch, the GT may keep busy after
> > > > running this subtest. Enabling rps should be tracked on the
> > > > other side.
> > > >
> > > > Signed-off-by: Chuansheng Liu <chuansheng.liu@intel.com>
> > > > ---
> > > >  tests/i915/i915_pm_rps.c | 1 +
> > > >  1 file changed, 1 insertion(+)
> > > >
> > > > diff --git a/tests/i915/i915_pm_rps.c b/tests/i915/i915_pm_rps.c
> > > > index ef627c0b..8c71c1a1 100644
> > > > --- a/tests/i915/i915_pm_rps.c
> > > > +++ b/tests/i915/i915_pm_rps.c
> > > > @@ -252,6 +252,7 @@ static void load_helper_run(enum load load)
> > > >
> > > >                 signal(SIGUSR1, load_helper_signal_handler);
> > > >                 signal(SIGUSR2, load_helper_signal_handler);
> > > > +               signal(SIGTERM, load_helper_signal_handler);
> > >
> > > I don't see any behaviour changes to igt to cause it to send SIGTERM on
> > > exit_subtest.
> >
> > Yes, exit_subtest() will not send SIGTERM out. But when main process calls
> > igt_exit() to exit, it hits the below assertion, then goes to fatal_sig_handler()
> with SIGABORT.
> > (i915_pm_rps:1680) igt_core-CRITICAL: Exiting with status code 98
> > i915_pm_rps: ../lib/igt_core.c:1775: igt_exit: Assertion `waitpid(-1, &tmp,
> WNOHANG) == -1 && errno == ECHILD' failed.
> > Received signal SIGABRT.
> 
> Ok, but that's not a huge concern, since we are already in an error state.
> My concern is for fixing whatever got us into that state.
Agree. In this case, we need to enable rps completely. Here I would like this quick
patch to unblock the following test cases.

Without this quick fix, it can mislead guys to catch the real root cause:)
Would you mind to get this patch merged at first?

> 
> > In fatal_sig_handler(), the installed exit handler fork_helper_exit_handler()
> > will send out the SIGTERM to all children process.
> >
> > >
> > > But you might as well just s/SIGUSR2/SIGTERM/ for clearer and common
> > > intentions.
> > Don't get your real point, SIGUSR1 is for actively stopping load_helper,
> SIGUSR2 is for
> > switching high and low load, the SIGTERM is for passively exiting.
> 
> I think the design of having a persistent helper process that leaks
> between subtests is broken. Then using three signals for essentially only
> 2 commands is aesthetically unpleasing.
Yes, to be honest, the main process should not receive SIGABRT according
to the initial code intention. Since the children processes should be cleaned
up, no matter it is load_helper or other created children process.
diff mbox series

Patch

diff --git a/tests/i915/i915_pm_rps.c b/tests/i915/i915_pm_rps.c
index ef627c0b..8c71c1a1 100644
--- a/tests/i915/i915_pm_rps.c
+++ b/tests/i915/i915_pm_rps.c
@@ -252,6 +252,7 @@  static void load_helper_run(enum load load)
 
 		signal(SIGUSR1, load_helper_signal_handler);
 		signal(SIGUSR2, load_helper_signal_handler);
+		signal(SIGTERM, load_helper_signal_handler);
 
 		igt_debug("Applying %s load...\n", lh.load ? "high" : "low");