diff mbox series

[08/23] common: fix pkill by running test program in a separate session

Message ID 173706974197.1927324.9208284704325894988.stgit@frogsfrogsfrogs (mailing list archive)
State New
Headers show
Series [01/23] generic/476: fix fsstress process management | expand

Commit Message

Darrick J. Wong Jan. 16, 2025, 11:27 p.m. UTC
From: Darrick J. Wong <djwong@kernel.org>

Run each test program with a separate session id so that we can tell
pkill to kill all processes of a given name, but only within our own
session id.  This /should/ suffice to run multiple fstests on the same
machine without one instance shooting down processes of another
instance.

This fixes a general problem with using "pkill --parent" -- if the
process being targeted is not a direct descendant of the bash script
calling pkill, then pkill will not do anything.  The scrub stress tests
make use of multiple background subshells, which is how a ^C in the
parent process fails to result in fsx/fsstress being killed.

This is necessary to fix SOAK_DURATION runtime constraints for all the
scrub stress tests.  However, there is a cost -- the test program no
longer runs with the same controlling tty as ./check, which means that
^Z doesn't work and SIGINT/SIGQUIT are set to SIG_IGN.  IOWs, if a test
wants to kill its subprocesses, it must use another signal such as
SIGPIPE.  Fortunately, bash doesn't whine about children dying due to
fatal signals if the children run in a different session id.

I also explored alternate designs, and this was the least unsatisfying:

a) Setting the process group didn't work because background subshells
are assigned a new group id.

b) Constraining the pkill/pgrep search to a cgroup could work, but we'd
have to set up a cgroup in which to run the fstest.

c) Putting test subprocesses in a systemd sub-scope and telling systemd
to kill the sub-scope could work because ./check can already use it to
ensure that all child processes of a test are killed.  However, this is
an *optional* feature, which means that we'd have to require systemd.

d) Constraining the pkill/pgrep search to a particular mount namespace
could work, but we already have tests that set up their own mount
namespaces, which means the constrained pgrep will not find all child
processes of a test.

e) Constraining to any other type of namespace (uts, pid, etc) might not
work because those namespaces might not be enabled.

f) Revert check-parallel and go back to one fstests instance per system.
Zorro already chose not to revert.

So.  Change _run_seq to create a the ./$seq process with a new session
id, update _su calls to use the same session as the parent test, update
all the pkill sites to use a wrapper so that we only target processes
created by *this* instance of fstests, and update SIGINT to SIGPIPE.

Cc: <fstests@vger.kernel.org> # v2024.12.08
Fixes: 8973af00ec212f ("fstests: cleanup fsstress process management")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 check             |   33 ++++++++++++++++++++++++++++-----
 common/fuzzy      |   17 ++++++++---------
 common/rc         |   12 ++++++++++--
 tests/generic/310 |    6 +++---
 tests/generic/561 |    2 +-
 5 files changed, 50 insertions(+), 20 deletions(-)

Comments

Dave Chinner Jan. 21, 2025, 3:28 a.m. UTC | #1
On Thu, Jan 16, 2025 at 03:27:15PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Run each test program with a separate session id so that we can tell
> pkill to kill all processes of a given name, but only within our own
> session id.  This /should/ suffice to run multiple fstests on the same
> machine without one instance shooting down processes of another
> instance.
> 
> This fixes a general problem with using "pkill --parent" -- if the
> process being targeted is not a direct descendant of the bash script
> calling pkill, then pkill will not do anything.  The scrub stress tests
> make use of multiple background subshells, which is how a ^C in the
> parent process fails to result in fsx/fsstress being killed.

Yeah, 'pkill --parent' was the best I had managed to come up that
mostly worked, not because it perfect. That was something I wanted
feedback on before merge because it still had problems...

> This is necessary to fix SOAK_DURATION runtime constraints for all the
> scrub stress tests.  However, there is a cost -- the test program no
> longer runs with the same controlling tty as ./check, which means that
> ^Z doesn't work and SIGINT/SIGQUIT are set to SIG_IGN.  IOWs, if a test
> wants to kill its subprocesses, it must use another signal such as
> SIGPIPE.  Fortunately, bash doesn't whine about children dying due to
> fatal signals if the children run in a different session id.
> 
> I also explored alternate designs, and this was the least unsatisfying:
> 
> a) Setting the process group didn't work because background subshells
> are assigned a new group id.

Yup, tried that.

> b) Constraining the pkill/pgrep search to a cgroup could work, but we'd
> have to set up a cgroup in which to run the fstest.

thought about that, too, and considered if systemd scopes could do
that, but ...

> 
> c) Putting test subprocesses in a systemd sub-scope and telling systemd
> to kill the sub-scope could work because ./check can already use it to
> ensure that all child processes of a test are killed.  However, this is
> an *optional* feature, which means that we'd have to require systemd.

... requiring systemd was somewhat of a show-stopper for testing
older distros.

> d) Constraining the pkill/pgrep search to a particular mount namespace
> could work, but we already have tests that set up their own mount
> namespaces, which means the constrained pgrep will not find all child
> processes of a test.

*nod*.

> e) Constraining to any other type of namespace (uts, pid, etc) might not
> work because those namespaces might not be enabled.

*nod*

I also tried modifying fsstress to catch and propagate signals and a
couple of other ways of managing processes in the stress code, but
all ended up having significantly worse downsides than using 'pkill
--parent'.

I was aware of session IDs, but I've never used them before and
hadn't gone down the rabbit hole of working out how to use them when
I posted the initial RFC patchset.

> f) Revert check-parallel and go back to one fstests instance per system.
> Zorro already chose not to revert.
> 
> So.  Change _run_seq to create a the ./$seq process with a new session
> id, update _su calls to use the same session as the parent test, update
> all the pkill sites to use a wrapper so that we only target processes
> created by *this* instance of fstests, and update SIGINT to SIGPIPE.

Yeah, that's definitely cleaner.

.....

> @@ -1173,13 +1173,11 @@ _scratch_xfs_stress_scrub_cleanup() {
>  	rm -f "$runningfile"
>  	echo "Cleaning up scrub stress run at $(date)" >> $seqres.full
>  
> -	# Send SIGINT so that bash won't print a 'Terminated' message that
> -	# distorts the golden output.
>  	echo "Killing stressor processes at $(date)" >> $seqres.full
> -	_kill_fsstress
> -	pkill -INT --parent $$ xfs_io >> $seqres.full 2>&1
> -	pkill -INT --parent $$ fsx >> $seqres.full 2>&1
> -	pkill -INT --parent $$ xfs_scrub >> $seqres.full 2>&1
> +	_pkill --echo -PIPE fsstress >> $seqres.full 2>&1
> +	_pkill --echo -PIPE xfs_io >> $seqres.full 2>&1
> +	_pkill --echo -PIPE fsx >> $seqres.full 2>&1
> +	_pkill --echo -PIPE xfs_scrub >> $seqres.full 2>&1

Removing _kill_fsstress is wrong - the fsstress process has already
been renamed, so open coding 'pkill fsstress' may not match. The
_kill_fsstress() code gets changed to do the right thing here:

> @@ -69,7 +75,7 @@ _kill_fsstress()
>  	if [ -n "$_FSSTRESS_PID" ]; then
>  		# use SIGPIPE to avoid "Killed" messages from bash
>  		echo "killing $_FSSTRESS_BIN" >> $seqres.full
> -		pkill -PIPE $_FSSTRESS_BIN >> $seqres.full 2>&1
> +		_pkill -PIPE $_FSSTRESS_BIN >> $seqres.full 2>&1
>  		_wait_for_fsstress
>  		return $?
>  	fi

Then in the next patch when the _FSSTRESS_BIN workaround goes away,
_kill_fsstress() is exactly what you open coded in
_scratch_xfs_stress_scrub_cleanup()....

i.e. common/fuzzy really shouldn't open code the fsstress process
management - it should use the wrapper like everything else does.

Everything else in the patch looks good.

-Dave.
Darrick J. Wong Jan. 22, 2025, 4:24 a.m. UTC | #2
On Tue, Jan 21, 2025 at 02:28:26PM +1100, Dave Chinner wrote:
> On Thu, Jan 16, 2025 at 03:27:15PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Run each test program with a separate session id so that we can tell
> > pkill to kill all processes of a given name, but only within our own
> > session id.  This /should/ suffice to run multiple fstests on the same
> > machine without one instance shooting down processes of another
> > instance.
> > 
> > This fixes a general problem with using "pkill --parent" -- if the
> > process being targeted is not a direct descendant of the bash script
> > calling pkill, then pkill will not do anything.  The scrub stress tests
> > make use of multiple background subshells, which is how a ^C in the
> > parent process fails to result in fsx/fsstress being killed.
> 
> Yeah, 'pkill --parent' was the best I had managed to come up that
> mostly worked, not because it perfect. That was something I wanted
> feedback on before merge because it still had problems...
> 
> > This is necessary to fix SOAK_DURATION runtime constraints for all the
> > scrub stress tests.  However, there is a cost -- the test program no
> > longer runs with the same controlling tty as ./check, which means that
> > ^Z doesn't work and SIGINT/SIGQUIT are set to SIG_IGN.  IOWs, if a test
> > wants to kill its subprocesses, it must use another signal such as
> > SIGPIPE.  Fortunately, bash doesn't whine about children dying due to
> > fatal signals if the children run in a different session id.
> > 
> > I also explored alternate designs, and this was the least unsatisfying:
> > 
> > a) Setting the process group didn't work because background subshells
> > are assigned a new group id.
> 
> Yup, tried that.
> 
> > b) Constraining the pkill/pgrep search to a cgroup could work, but we'd
> > have to set up a cgroup in which to run the fstest.
> 
> thought about that, too, and considered if systemd scopes could do
> that, but ...
> 
> > 
> > c) Putting test subprocesses in a systemd sub-scope and telling systemd
> > to kill the sub-scope could work because ./check can already use it to
> > ensure that all child processes of a test are killed.  However, this is
> > an *optional* feature, which means that we'd have to require systemd.
> 
> ... requiring systemd was somewhat of a show-stopper for testing
> older distros.

Isn't RHEL7 the oldest one at this point?  And it does systemd.  At this
point the only reason I didn't go full systemd is out of consideration
for Devuan, since they probably need QA.

> > d) Constraining the pkill/pgrep search to a particular mount namespace
> > could work, but we already have tests that set up their own mount
> > namespaces, which means the constrained pgrep will not find all child
> > processes of a test.
> 
> *nod*.
> 
> > e) Constraining to any other type of namespace (uts, pid, etc) might not
> > work because those namespaces might not be enabled.
> 
> *nod*
> 
> I also tried modifying fsstress to catch and propagate signals and a
> couple of other ways of managing processes in the stress code, but
> all ended up having significantly worse downsides than using 'pkill
> --parent'.

Yeah, and then you'd still have to figure out fsx and any other random
little utility that a test might run in a background.

> I was aware of session IDs, but I've never used them before and
> hadn't gone down the rabbit hole of working out how to use them when
> I posted the initial RFC patchset.

<nod> Session IDs kinda suck, but they suck the least for nearly minimal
extra effort.

> > f) Revert check-parallel and go back to one fstests instance per system.
> > Zorro already chose not to revert.
> > 
> > So.  Change _run_seq to create a the ./$seq process with a new session
> > id, update _su calls to use the same session as the parent test, update
> > all the pkill sites to use a wrapper so that we only target processes
> > created by *this* instance of fstests, and update SIGINT to SIGPIPE.
> 
> Yeah, that's definitely cleaner.
> 
> .....
> 
> > @@ -1173,13 +1173,11 @@ _scratch_xfs_stress_scrub_cleanup() {
> >  	rm -f "$runningfile"
> >  	echo "Cleaning up scrub stress run at $(date)" >> $seqres.full
> >  
> > -	# Send SIGINT so that bash won't print a 'Terminated' message that
> > -	# distorts the golden output.
> >  	echo "Killing stressor processes at $(date)" >> $seqres.full
> > -	_kill_fsstress
> > -	pkill -INT --parent $$ xfs_io >> $seqres.full 2>&1
> > -	pkill -INT --parent $$ fsx >> $seqres.full 2>&1
> > -	pkill -INT --parent $$ xfs_scrub >> $seqres.full 2>&1
> > +	_pkill --echo -PIPE fsstress >> $seqres.full 2>&1
> > +	_pkill --echo -PIPE xfs_io >> $seqres.full 2>&1
> > +	_pkill --echo -PIPE fsx >> $seqres.full 2>&1
> > +	_pkill --echo -PIPE xfs_scrub >> $seqres.full 2>&1
> 
> Removing _kill_fsstress is wrong - the fsstress process has already
> been renamed, so open coding 'pkill fsstress' may not match. The
> _kill_fsstress() code gets changed to do the right thing here:
> 
> > @@ -69,7 +75,7 @@ _kill_fsstress()
> >  	if [ -n "$_FSSTRESS_PID" ]; then
> >  		# use SIGPIPE to avoid "Killed" messages from bash
> >  		echo "killing $_FSSTRESS_BIN" >> $seqres.full
> > -		pkill -PIPE $_FSSTRESS_BIN >> $seqres.full 2>&1
> > +		_pkill -PIPE $_FSSTRESS_BIN >> $seqres.full 2>&1
> >  		_wait_for_fsstress
> >  		return $?
> >  	fi
> 
> Then in the next patch when the _FSSTRESS_BIN workaround goes away,
> _kill_fsstress() is exactly what you open coded in
> _scratch_xfs_stress_scrub_cleanup()....
> 
> i.e. common/fuzzy really shouldn't open code the fsstress process
> management - it should use the wrapper like everything else does.

Ok will change.  I suppose I did go fix up the setting (or not) of
_FSSTRESS_PID.

> Everything else in the patch looks good.

Cool!

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner Jan. 22, 2025, 6:08 a.m. UTC | #3
On Tue, Jan 21, 2025 at 08:24:00PM -0800, Darrick J. Wong wrote:
> On Tue, Jan 21, 2025 at 02:28:26PM +1100, Dave Chinner wrote:
> > On Thu, Jan 16, 2025 at 03:27:15PM -0800, Darrick J. Wong wrote:
> > > c) Putting test subprocesses in a systemd sub-scope and telling systemd
> > > to kill the sub-scope could work because ./check can already use it to
> > > ensure that all child processes of a test are killed.  However, this is
> > > an *optional* feature, which means that we'd have to require systemd.
> > 
> > ... requiring systemd was somewhat of a show-stopper for testing
> > older distros.
> 
> Isn't RHEL7 the oldest one at this point?  And it does systemd.  At this
> point the only reason I didn't go full systemd is out of consideration
> for Devuan, since they probably need QA.

I have no idea what is out there in distro land vs what fstests
"supports". All I know is that there are distros out there that
don't use systemd.

It feels like poor form to prevent generic filesystem QA
infrastructure from running on those distros by making an avoidable
choice to tie the infrastructure exclusively to systemd-based
functionality....

-Dave.
Darrick J. Wong Jan. 22, 2025, 7:05 a.m. UTC | #4
On Wed, Jan 22, 2025 at 05:08:17PM +1100, Dave Chinner wrote:
> On Tue, Jan 21, 2025 at 08:24:00PM -0800, Darrick J. Wong wrote:
> > On Tue, Jan 21, 2025 at 02:28:26PM +1100, Dave Chinner wrote:
> > > On Thu, Jan 16, 2025 at 03:27:15PM -0800, Darrick J. Wong wrote:
> > > > c) Putting test subprocesses in a systemd sub-scope and telling systemd
> > > > to kill the sub-scope could work because ./check can already use it to
> > > > ensure that all child processes of a test are killed.  However, this is
> > > > an *optional* feature, which means that we'd have to require systemd.
> > > 
> > > ... requiring systemd was somewhat of a show-stopper for testing
> > > older distros.
> > 
> > Isn't RHEL7 the oldest one at this point?  And it does systemd.  At this
> > point the only reason I didn't go full systemd is out of consideration
> > for Devuan, since they probably need QA.
> 
> I have no idea what is out there in distro land vs what fstests
> "supports". All I know is that there are distros out there that
> don't use systemd.
> 
> It feels like poor form to prevent generic filesystem QA
> infrastructure from running on those distros by making an avoidable
> choice to tie the infrastructure exclusively to systemd-based
> functionality....

Agreed, though at some point after these bugfixes are merged I'll see if
I can build on the existing "if you have systemd then ___ else here's
your shabby opencoded version" logic in fstests to isolate the ./checks
from each other a little better.  It'd be kinda nice if we could
actually just put them in something resembling a modernish container,
albeit with the same underlying fs.

<shrug> Anyone else interested in that?

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner Jan. 22, 2025, 9:42 a.m. UTC | #5
On Tue, Jan 21, 2025 at 11:05:20PM -0800, Darrick J. Wong wrote:
> On Wed, Jan 22, 2025 at 05:08:17PM +1100, Dave Chinner wrote:
> > On Tue, Jan 21, 2025 at 08:24:00PM -0800, Darrick J. Wong wrote:
> > > On Tue, Jan 21, 2025 at 02:28:26PM +1100, Dave Chinner wrote:
> > > > On Thu, Jan 16, 2025 at 03:27:15PM -0800, Darrick J. Wong wrote:
> > > > > c) Putting test subprocesses in a systemd sub-scope and telling systemd
> > > > > to kill the sub-scope could work because ./check can already use it to
> > > > > ensure that all child processes of a test are killed.  However, this is
> > > > > an *optional* feature, which means that we'd have to require systemd.
> > > > 
> > > > ... requiring systemd was somewhat of a show-stopper for testing
> > > > older distros.
> > > 
> > > Isn't RHEL7 the oldest one at this point?  And it does systemd.  At this
> > > point the only reason I didn't go full systemd is out of consideration
> > > for Devuan, since they probably need QA.
> > 
> > I have no idea what is out there in distro land vs what fstests
> > "supports". All I know is that there are distros out there that
> > don't use systemd.
> > 
> > It feels like poor form to prevent generic filesystem QA
> > infrastructure from running on those distros by making an avoidable
> > choice to tie the infrastructure exclusively to systemd-based
> > functionality....
> 
> Agreed, though at some point after these bugfixes are merged I'll see if
> I can build on the existing "if you have systemd then ___ else here's
> your shabby opencoded version" logic in fstests to isolate the ./checks
> from each other a little better.  It'd be kinda nice if we could
> actually just put them in something resembling a modernish container,
> albeit with the same underlying fs.

Agreed, but I don't think we need to depend on systemd for that,
either.

> <shrug> Anyone else interested in that?

check-parallel has already started down that road with the
mount namespace isolation it uses for the runner tasks via
src/nsexec.c.

My plan has been to factor more of the check test running code
(similar to what I did with the test list parsing) so that the
check-parallel can iterate sections itself and runners can execute
individual tests directly, rather than bouncing them through check
to execute a set of tests serially. Then check-parallel could do
whatever it needed to isolate individual tests from each other and
nothing in check would need to change.

Now I'm wondering if I can just run each runner's check instance
in it's own private PID namespace as easily as I'm running them in
their own private mount namespace...

Hmmm - looks like src/nsexec.c can create new PID namespaces via
the "-p" option. I haven't used that before - I wonder if that's a
better solution that using per-test session IDs to solve the pkill
--parent problem? Something to look into in the morning....

-Dave.
Darrick J. Wong Jan. 22, 2025, 9:46 p.m. UTC | #6
On Wed, Jan 22, 2025 at 08:42:49PM +1100, Dave Chinner wrote:
> On Tue, Jan 21, 2025 at 11:05:20PM -0800, Darrick J. Wong wrote:
> > On Wed, Jan 22, 2025 at 05:08:17PM +1100, Dave Chinner wrote:
> > > On Tue, Jan 21, 2025 at 08:24:00PM -0800, Darrick J. Wong wrote:
> > > > On Tue, Jan 21, 2025 at 02:28:26PM +1100, Dave Chinner wrote:
> > > > > On Thu, Jan 16, 2025 at 03:27:15PM -0800, Darrick J. Wong wrote:
> > > > > > c) Putting test subprocesses in a systemd sub-scope and telling systemd
> > > > > > to kill the sub-scope could work because ./check can already use it to
> > > > > > ensure that all child processes of a test are killed.  However, this is
> > > > > > an *optional* feature, which means that we'd have to require systemd.
> > > > > 
> > > > > ... requiring systemd was somewhat of a show-stopper for testing
> > > > > older distros.
> > > > 
> > > > Isn't RHEL7 the oldest one at this point?  And it does systemd.  At this
> > > > point the only reason I didn't go full systemd is out of consideration
> > > > for Devuan, since they probably need QA.
> > > 
> > > I have no idea what is out there in distro land vs what fstests
> > > "supports". All I know is that there are distros out there that
> > > don't use systemd.
> > > 
> > > It feels like poor form to prevent generic filesystem QA
> > > infrastructure from running on those distros by making an avoidable
> > > choice to tie the infrastructure exclusively to systemd-based
> > > functionality....
> > 
> > Agreed, though at some point after these bugfixes are merged I'll see if
> > I can build on the existing "if you have systemd then ___ else here's
> > your shabby opencoded version" logic in fstests to isolate the ./checks
> > from each other a little better.  It'd be kinda nice if we could
> > actually just put them in something resembling a modernish container,
> > albeit with the same underlying fs.
> 
> Agreed, but I don't think we need to depend on systemd for that,
> either.
> 
> > <shrug> Anyone else interested in that?
> 
> check-parallel has already started down that road with the
> mount namespace isolation it uses for the runner tasks via
> src/nsexec.c.
> 
> My plan has been to factor more of the check test running code
> (similar to what I did with the test list parsing) so that the
> check-parallel can iterate sections itself and runners can execute
> individual tests directly, rather than bouncing them through check
> to execute a set of tests serially. Then check-parallel could do
> whatever it needed to isolate individual tests from each other and
> nothing in check would need to change.
> 
> Now I'm wondering if I can just run each runner's check instance
> in it's own private PID namespace as easily as I'm running them in
> their own private mount namespace...
> 
> Hmmm - looks like src/nsexec.c can create new PID namespaces via
> the "-p" option. I haven't used that before - I wonder if that's a
> better solution that using per-test session IDs to solve the pkill
> --parent problem? Something to look into in the morning....

I tried that -- it appears to work, but then:

# ./src/nsexec -p bash
Current time: Wed Jan 22 13:43:33 PST 2025; Terminal: /dev/pts/0
# ps
fatal library error, lookup self
# 

Regular processes and pkill/pgrep seem to work, but that didn't seem
especially encouraging so I stopped. :/

--D

> -Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner Jan. 23, 2025, 1:16 a.m. UTC | #7
On Wed, Jan 22, 2025 at 01:46:09PM -0800, Darrick J. Wong wrote:
> On Wed, Jan 22, 2025 at 08:42:49PM +1100, Dave Chinner wrote:
> > On Tue, Jan 21, 2025 at 11:05:20PM -0800, Darrick J. Wong wrote:
> > > On Wed, Jan 22, 2025 at 05:08:17PM +1100, Dave Chinner wrote:
> > > > On Tue, Jan 21, 2025 at 08:24:00PM -0800, Darrick J. Wong wrote:
> > > > > On Tue, Jan 21, 2025 at 02:28:26PM +1100, Dave Chinner wrote:
> > > > > > On Thu, Jan 16, 2025 at 03:27:15PM -0800, Darrick J. Wong wrote:
> > > > > > > c) Putting test subprocesses in a systemd sub-scope and telling systemd
> > > > > > > to kill the sub-scope could work because ./check can already use it to
> > > > > > > ensure that all child processes of a test are killed.  However, this is
> > > > > > > an *optional* feature, which means that we'd have to require systemd.
> > > > > > 
> > > > > > ... requiring systemd was somewhat of a show-stopper for testing
> > > > > > older distros.
> > > > > 
> > > > > Isn't RHEL7 the oldest one at this point?  And it does systemd.  At this
> > > > > point the only reason I didn't go full systemd is out of consideration
> > > > > for Devuan, since they probably need QA.
> > > > 
> > > > I have no idea what is out there in distro land vs what fstests
> > > > "supports". All I know is that there are distros out there that
> > > > don't use systemd.
> > > > 
> > > > It feels like poor form to prevent generic filesystem QA
> > > > infrastructure from running on those distros by making an avoidable
> > > > choice to tie the infrastructure exclusively to systemd-based
> > > > functionality....
> > > 
> > > Agreed, though at some point after these bugfixes are merged I'll see if
> > > I can build on the existing "if you have systemd then ___ else here's
> > > your shabby opencoded version" logic in fstests to isolate the ./checks
> > > from each other a little better.  It'd be kinda nice if we could
> > > actually just put them in something resembling a modernish container,
> > > albeit with the same underlying fs.
> > 
> > Agreed, but I don't think we need to depend on systemd for that,
> > either.
> > 
> > > <shrug> Anyone else interested in that?
> > 
> > check-parallel has already started down that road with the
> > mount namespace isolation it uses for the runner tasks via
> > src/nsexec.c.
> > 
> > My plan has been to factor more of the check test running code
> > (similar to what I did with the test list parsing) so that the
> > check-parallel can iterate sections itself and runners can execute
> > individual tests directly, rather than bouncing them through check
> > to execute a set of tests serially. Then check-parallel could do
> > whatever it needed to isolate individual tests from each other and
> > nothing in check would need to change.
> > 
> > Now I'm wondering if I can just run each runner's check instance
> > in it's own private PID namespace as easily as I'm running them in
> > their own private mount namespace...
> > 
> > Hmmm - looks like src/nsexec.c can create new PID namespaces via
> > the "-p" option. I haven't used that before - I wonder if that's a
> > better solution that using per-test session IDs to solve the pkill
> > --parent problem? Something to look into in the morning....
> 
> I tried that -- it appears to work, but then:
> 
> # ./src/nsexec -p bash
> Current time: Wed Jan 22 13:43:33 PST 2025; Terminal: /dev/pts/0
> # ps
> fatal library error, lookup self
> # 

That looks like a bug in whatever distro you are using - it works
as it should here on a recent debian unstable userspace.

Note, however, that ps will show all processes in both the parent
and the child namespace the shell is running on because the contents
of /proc are the same for both.

However, because we are also using private mount namespaces for the
check process, pid_namespaces(7) tells us:

/proc and PID namespaces

       A /proc filesystem shows (in the /proc/pid directories) only
       processes visible in the PID namespace of the process that
       performed the mount, even if the /proc filesystem is viewed
       from processes in other namespaces.

       After creating a new PID namespace, it is useful for the
       child to change its root directory and mount a new procfs
       instance at /proc so that tools  such  as  ps(1) work
>>>    correctly.  If a new mount namespace is simultaneously
>>>    created by including CLONE_NEWNS in the flags argument of
>>>    clone(2) or unshare(2), then it isn't necessary to change the
>>>    root directory: a new procfs instance can be mounted directly
>>>    over /proc.

       From a shell, the command to mount /proc is:

           $ mount -t proc proc /proc

       Calling readlink(2) on the path /proc/self yields the process
       ID of the caller in the PID namespace of the procfs mount
       (i.e., the PID name‐ space of the process that mounted the
       procfs).  This can be useful for introspection purposes, when
       a process wants to discover its  PID  in other namespaces.

This appears to give us an environment that only shows the processes
within the current PID namespace:

$ sudo src/nsexec -p -m bash
# mount -t proc proc /proc
# ps waux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   7384  3744 pts/1    S    11:55   0:00 bash
root          72  0.0  0.0   8300  3736 pts/1    R+   12:04   0:00 ps waux
# pstree -N pid
[4026538173]
bash───pstree
#

Yup, there we go - we have full PID isolation for this shell.

OK, I suspect this is a better way to proceed for check-parallel than
to need session IDs for individual tests and wrappers for
pgrep/pkill/pidwait. Let me see what breaks when I use this.....

-Dave.
Dave Chinner Jan. 28, 2025, 4:34 a.m. UTC | #8
On Thu, Jan 23, 2025 at 12:16:50PM +1100, Dave Chinner wrote:
> On Wed, Jan 22, 2025 at 01:46:09PM -0800, Darrick J. Wong wrote:
> > > > Agreed, though at some point after these bugfixes are merged I'll see if
> > > > I can build on the existing "if you have systemd then ___ else here's
> > > > your shabby opencoded version" logic in fstests to isolate the ./checks
> > > > from each other a little better.  It'd be kinda nice if we could
> > > > actually just put them in something resembling a modernish container,
> > > > albeit with the same underlying fs.
> > > 
> > > Agreed, but I don't think we need to depend on systemd for that,
> > > either.
> > > 
> > > > <shrug> Anyone else interested in that?
> > > 
> > > check-parallel has already started down that road with the
> > > mount namespace isolation it uses for the runner tasks via
> > > src/nsexec.c.
.....
> > > Hmmm - looks like src/nsexec.c can create new PID namespaces via
> > > the "-p" option. I haven't used that before - I wonder if that's a
> > > better solution that using per-test session IDs to solve the pkill
> > > --parent problem? Something to look into in the morning....

.....

> Note, however, that ps will show all processes in both the parent
> and the child namespace the shell is running on because the contents
> of /proc are the same for both.
> 
> However, because we are also using private mount namespaces for the
> check process, pid_namespaces(7) tells us:
> 
> /proc and PID namespaces
> 
>        A /proc filesystem shows (in the /proc/pid directories) only
>        processes visible in the PID namespace of the process that
>        performed the mount, even if the /proc filesystem is viewed
>        from processes in other namespaces.
> 
>        After creating a new PID namespace, it is useful for the
>        child to change its root directory and mount a new procfs
>        instance at /proc so that tools  such  as  ps(1) work
> >>>    correctly.  If a new mount namespace is simultaneously
> >>>    created by including CLONE_NEWNS in the flags argument of
> >>>    clone(2) or unshare(2), then it isn't necessary to change the
> >>>    root directory: a new procfs instance can be mounted directly
> >>>    over /proc.
> 
>        From a shell, the command to mount /proc is:
> 
>            $ mount -t proc proc /proc
> 
>        Calling readlink(2) on the path /proc/self yields the process
>        ID of the caller in the PID namespace of the procfs mount
>        (i.e., the PID name‐ space of the process that mounted the
>        procfs).  This can be useful for introspection purposes, when
>        a process wants to discover its  PID  in other namespaces.
> 
> This appears to give us an environment that only shows the processes
> within the current PID namespace:
> 
> $ sudo src/nsexec -p -m bash
> # mount -t proc proc /proc
> # ps waux
> USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root           1  0.0  0.0   7384  3744 pts/1    S    11:55   0:00 bash
> root          72  0.0  0.0   8300  3736 pts/1    R+   12:04   0:00 ps waux
> # pstree -N pid
> [4026538173]
> bash───pstree
> #
> 
> Yup, there we go - we have full PID isolation for this shell.

Ok, it took me some time to get this to work reliably - the way was
full of landmines with little documentation to guide around them.

1. If multiple commands are needed to be run from nsexec, a helper
   script is needed (I called it check-helper).

2. You have to `mount --make-private /proc` before doing anything to
   make the mounts of /proc inside the pid namespace work correctly.
   If you don't do this, every other mount namespace will also see
   only the newest mount, and that means every runner but the last
   one to start with the wrong mount and PID namespace view in /proc.

3. if you get the system into the state described by 1), unmounting
   /proc in each runner then races to remove the top /proc mount.
   Many threads get -EBUSY failures from unmount, leaving many stale
   mounts of /proc behind.

4. the top level shell where check-parallel was invoked is left with
   the view where /proc has been unmounted from a PID/mount
   namespace that has gone away. Hence /proc now has no processes or
   mounts being reported and nothing works until you mount a new
   instance /proc in that namespace.

5. After mounting proc again there are lots of mounts of stale /proc
   mounts still reported. They cannot be unmounted as the mount
   namespaces that created them have gone away and unmounting /proc
   in the current shell simply removes the last mounted one and we
   goto 4).

4. /tmp is still shared across all runner instances so all the

   concurrent runners dump all their tmp files in /tmp. However, the
   runners no longer have unique PIDs (i.e. check runs as PID 3 in
   all runner instaces). This means using /tmp/tmp.$$ as the
   check/test temp file definition results is instant tmp file name
   collisions and random things in check and tests fail.  check and
   common/preamble have to be converted to use `mktemp` to provide
   unique tempfile name prefixes again.

5. Don't forget to revert the parent /proc mount back to shared
   after check has finished running (or was aborted).

I think with this (current prototype patch below), we can use PID
namespaces rather than process session IDs for check-parallel safe
process management.

Thoughts?

-Dave.
Darrick J. Wong Jan. 28, 2025, 7:23 a.m. UTC | #9
On Tue, Jan 28, 2025 at 03:34:50PM +1100, Dave Chinner wrote:
> On Thu, Jan 23, 2025 at 12:16:50PM +1100, Dave Chinner wrote:
> > On Wed, Jan 22, 2025 at 01:46:09PM -0800, Darrick J. Wong wrote:
> > > > > Agreed, though at some point after these bugfixes are merged I'll see if
> > > > > I can build on the existing "if you have systemd then ___ else here's
> > > > > your shabby opencoded version" logic in fstests to isolate the ./checks
> > > > > from each other a little better.  It'd be kinda nice if we could
> > > > > actually just put them in something resembling a modernish container,
> > > > > albeit with the same underlying fs.
> > > > 
> > > > Agreed, but I don't think we need to depend on systemd for that,
> > > > either.
> > > > 
> > > > > <shrug> Anyone else interested in that?
> > > > 
> > > > check-parallel has already started down that road with the
> > > > mount namespace isolation it uses for the runner tasks via
> > > > src/nsexec.c.
> .....
> > > > Hmmm - looks like src/nsexec.c can create new PID namespaces via
> > > > the "-p" option. I haven't used that before - I wonder if that's a
> > > > better solution that using per-test session IDs to solve the pkill
> > > > --parent problem? Something to look into in the morning....
> 
> .....
> 
> > Note, however, that ps will show all processes in both the parent
> > and the child namespace the shell is running on because the contents
> > of /proc are the same for both.
> > 
> > However, because we are also using private mount namespaces for the
> > check process, pid_namespaces(7) tells us:
> > 
> > /proc and PID namespaces
> > 
> >        A /proc filesystem shows (in the /proc/pid directories) only
> >        processes visible in the PID namespace of the process that
> >        performed the mount, even if the /proc filesystem is viewed
> >        from processes in other namespaces.
> > 
> >        After creating a new PID namespace, it is useful for the
> >        child to change its root directory and mount a new procfs
> >        instance at /proc so that tools  such  as  ps(1) work
> > >>>    correctly.  If a new mount namespace is simultaneously
> > >>>    created by including CLONE_NEWNS in the flags argument of
> > >>>    clone(2) or unshare(2), then it isn't necessary to change the
> > >>>    root directory: a new procfs instance can be mounted directly
> > >>>    over /proc.
> > 
> >        From a shell, the command to mount /proc is:
> > 
> >            $ mount -t proc proc /proc
> > 
> >        Calling readlink(2) on the path /proc/self yields the process
> >        ID of the caller in the PID namespace of the procfs mount
> >        (i.e., the PID name‐ space of the process that mounted the
> >        procfs).  This can be useful for introspection purposes, when
> >        a process wants to discover its  PID  in other namespaces.
> > 
> > This appears to give us an environment that only shows the processes
> > within the current PID namespace:
> > 
> > $ sudo src/nsexec -p -m bash
> > # mount -t proc proc /proc
> > # ps waux
> > USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> > root           1  0.0  0.0   7384  3744 pts/1    S    11:55   0:00 bash
> > root          72  0.0  0.0   8300  3736 pts/1    R+   12:04   0:00 ps waux
> > # pstree -N pid
> > [4026538173]
> > bash───pstree
> > #
> > 
> > Yup, there we go - we have full PID isolation for this shell.
> 
> Ok, it took me some time to get this to work reliably - the way was
> full of landmines with little documentation to guide around them.
> 
> 1. If multiple commands are needed to be run from nsexec, a helper
>    script is needed (I called it check-helper).
> 
> 2. You have to `mount --make-private /proc` before doing anything to
>    make the mounts of /proc inside the pid namespace work correctly.
>    If you don't do this, every other mount namespace will also see
>    only the newest mount, and that means every runner but the last
>    one to start with the wrong mount and PID namespace view in /proc.
> 
> 3. if you get the system into the state described by 1), unmounting
>    /proc in each runner then races to remove the top /proc mount.
>    Many threads get -EBUSY failures from unmount, leaving many stale
>    mounts of /proc behind.
> 
> 4. the top level shell where check-parallel was invoked is left with
>    the view where /proc has been unmounted from a PID/mount
>    namespace that has gone away. Hence /proc now has no processes or
>    mounts being reported and nothing works until you mount a new
>    instance /proc in that namespace.
> 
> 5. After mounting proc again there are lots of mounts of stale /proc
>    mounts still reported. They cannot be unmounted as the mount
>    namespaces that created them have gone away and unmounting /proc
>    in the current shell simply removes the last mounted one and we
>    goto 4).
> 
> 4. /tmp is still shared across all runner instances so all the
> 
>    concurrent runners dump all their tmp files in /tmp. However, the
>    runners no longer have unique PIDs (i.e. check runs as PID 3 in
>    all runner instaces). This means using /tmp/tmp.$$ as the
>    check/test temp file definition results is instant tmp file name
>    collisions and random things in check and tests fail.  check and
>    common/preamble have to be converted to use `mktemp` to provide
>    unique tempfile name prefixes again.
> 
> 5. Don't forget to revert the parent /proc mount back to shared
>    after check has finished running (or was aborted).
> 
> I think with this (current prototype patch below), we can use PID
> namespaces rather than process session IDs for check-parallel safe
> process management.
> 
> Thoughts?

Was about to go to bed, but can we also start a new mount namespace,
create a private (or at least non-global) /tmp to put files into, and
then each test instance is isolated from clobbering the /tmpfiles of
other ./check instances *and* the rest of the system?

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> 
> check-parallel: use PID namespaces for runner process isolation
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> This provides isolation between individual runners so that they
> cannot see the processes that other test runners have created.
> This means tools like pkill will only find processes run by the test
> that is calling it, hence there is no danger taht it might kill
> processes owned by a different test in a different runner context.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  check           |  6 +++++-
>  check-helper    | 23 +++++++++++++++++++++++
>  check-parallel  | 25 +++++++++++++++++++------
>  common/preamble |  5 ++++-
>  4 files changed, 51 insertions(+), 8 deletions(-)
> 
> diff --git a/check b/check
> index 4dc266dcf..314436667 100755
> --- a/check
> +++ b/check
> @@ -4,7 +4,11 @@
>  #
>  # Control script for QA
>  #
> -tmp=/tmp/$$
> +
> +# When run from a pid namespace, /tmp/tmp.$$ is not a unique identifier.
> +# Use mktemp instead.
> +tmp=`mktemp`
> +
>  status=0
>  needwrap=true
>  needsum=true
> diff --git a/check-helper b/check-helper
> new file mode 100755
> index 000000000..47a92de8b
> --- /dev/null
> +++ b/check-helper
> @@ -0,0 +1,23 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2025 Red Hat, Inc.  All Rights Reserved.
> +#
> +
> +# When check is run from check-parallel, it is run in private mount and PID
> +# namespaces. We need to set up /proc to only show processes in this PID
> +# namespace, so we mount a new instance of /proc over the top of the existing
> +# version. Because we are in a private mount namespace, the does not propagate
> +# outside this context and hence does not conflict with the other test runners
> +# that are performing the same setup actions.
> +
> +args="$@"
> +
> +#echo $args
> +mount -t proc proc /proc
> +ret=$?
> +if [ $ret -eq 0 ]; then
> +	./check $args
> +	umount -l /proc
> +else
> +	echo failed to mount /proc, ret $ret!
> +fi
> diff --git a/check-parallel b/check-parallel
> index 2fbf0fdbe..6082baf5e 100755
> --- a/check-parallel
> +++ b/check-parallel
> @@ -259,14 +259,14 @@ runner_go()
>  	rm -f $RESULT_BASE/check.*
>  
>  	# Only supports default mkfs parameters right now
> -	wipefs -a $TEST_DEV > /dev/null 2>&1
> -	yes | mkfs -t $FSTYP $TEST_MKFS_OPTS $TEST_DEV > /dev/null 2>&1
> +	wipefs -a $TEST_DEV > $me/log 2>&1
> +	yes | mkfs -t $FSTYP $TEST_MKFS_OPTS $TEST_DEV >> $me/log 2>&1
>  
>  #	export DUMP_CORRUPT_FS=1
>  
> -	# Run the tests in it's own mount namespace, as per the comment below
> -	# that precedes making the basedir a private mount.
> -	./src/nsexec -m ./check $run_section -x unreliable_in_parallel --exact-order ${runner_list[$id]} > $me/log 2>&1
> +	# Run the tests in it's own mount and PID namespace, as per the comment
> +	# below that precedes making the basedir a private mount.
> +	./src/nsexec -m -p ./check-helper $run_section -x unreliable_in_parallel --exact-order ${runner_list[$id]} >> $me/log 2>&1
>  
>  	wait
>  	sleep 1
> @@ -291,6 +291,8 @@ cleanup()
>  	umount -R $basedir/*/test 2> /dev/null
>  	umount -R $basedir/*/scratch 2> /dev/null
>  	losetup --detach-all
> +	mount --make-shared /proc
> +	mount --make-shared $basedir
>  }
>  
>  trap "cleanup; exit" HUP INT QUIT TERM
> @@ -311,10 +313,17 @@ fi
>  # controls the mount to succeed without actually unmounting the filesytsem
>  # because a mount namespace still holds a reference to it. This causes other
>  # operations on the block device to fail as it is still busy (e.g. fsck, mkfs,
> -# etc). Hence we make the basedir private here and then run each check instance
> +# etc).
> +#
> +# Hence we make the basedir private here and then run each check instance
>  # in it's own mount namespace so that they cannot see mounts that other tests
>  # are performing.
> +#
> +# We also need to make /proc private so that the runners can be run cleanly in
> +# a PID namespace. This requires an new mount of /proc inside the PID namespace,
> +# and this requires a private mount namespace to work correctly.
>  mount --make-private $basedir
> +mount --make-private /proc
>  
>  now=`date +%Y-%m-%d-%H:%M:%S`
>  for ((i = 0; i < $runners; i++)); do
> @@ -328,6 +337,10 @@ for ((i = 0; i < $runners; i++)); do
>  done;
>  wait
>  
> +# Restore the parents back to shared mount namespace behaviour.
> +mount --make-shared /proc
> +mount --make-shared $basedir
> +
>  if [ -n "$show_test_list" ]; then
>  	exit 0
>  fi
> diff --git a/common/preamble b/common/preamble
> index 78e45d522..8f47b172a 100644
> --- a/common/preamble
> +++ b/common/preamble
> @@ -43,8 +43,11 @@ _begin_fstest()
>  	seqres=$RESULT_DIR/$seq
>  	echo "QA output created by $seq"
>  
> +	# When run from a pid namespace, /tmp/tmp.$$ is not a unique identifier.
> +	# Use mktemp instead.
> +	tmp=`mktemp`
> +
>  	here=`pwd`
> -	tmp=/tmp/$$
>  	status=1	# failure is the default!
>  
>  	_register_cleanup _cleanup
Dave Chinner Jan. 28, 2025, 8:39 p.m. UTC | #10
On Mon, Jan 27, 2025 at 11:23:52PM -0800, Darrick J. Wong wrote:
> On Tue, Jan 28, 2025 at 03:34:50PM +1100, Dave Chinner wrote:
> > On Thu, Jan 23, 2025 at 12:16:50PM +1100, Dave Chinner wrote:
> > 4. /tmp is still shared across all runner instances so all the
> > 
> >    concurrent runners dump all their tmp files in /tmp. However, the
> >    runners no longer have unique PIDs (i.e. check runs as PID 3 in
> >    all runner instaces). This means using /tmp/tmp.$$ as the
> >    check/test temp file definition results is instant tmp file name
> >    collisions and random things in check and tests fail.  check and
> >    common/preamble have to be converted to use `mktemp` to provide
> >    unique tempfile name prefixes again.
> > 
> > 5. Don't forget to revert the parent /proc mount back to shared
> >    after check has finished running (or was aborted).
> > 
> > I think with this (current prototype patch below), we can use PID
> > namespaces rather than process session IDs for check-parallel safe
> > process management.
> > 
> > Thoughts?
> 
> Was about to go to bed, but can we also start a new mount namespace,
> create a private (or at least non-global) /tmp to put files into, and
> then each test instance is isolated from clobbering the /tmpfiles of
> other ./check instances *and* the rest of the system?

We probably can. I didn't want to go down that rat hole straight
away, because then I'd have to make a decision about what to mount
there. One thing at a time....

I suspect that I can just use a tmpfs filesystem for it - there's
heaps of memory available on my test machines and we don't use /tmp
to hold large files, so that should work fine for me.  However, I'm
a little concerned about what will happen when testing under memory
pressure situations if /tmp needs memory to operate correctly.

I'll have a look at what is needed for private tmpfs /tmp instances
to work - it should work just fine.

However, if check-parallel has taught me anything, it is that trying
to use "should work" features on a modern system tends to mean "this
is a poorly documented rat-hole that with many dead-ends that will
be explored before a working solution is found"...

-Dave.
Darrick J. Wong Jan. 29, 2025, 3:13 a.m. UTC | #11
On Wed, Jan 29, 2025 at 07:39:22AM +1100, Dave Chinner wrote:
> On Mon, Jan 27, 2025 at 11:23:52PM -0800, Darrick J. Wong wrote:
> > On Tue, Jan 28, 2025 at 03:34:50PM +1100, Dave Chinner wrote:
> > > On Thu, Jan 23, 2025 at 12:16:50PM +1100, Dave Chinner wrote:
> > > 4. /tmp is still shared across all runner instances so all the
> > > 
> > >    concurrent runners dump all their tmp files in /tmp. However, the
> > >    runners no longer have unique PIDs (i.e. check runs as PID 3 in
> > >    all runner instaces). This means using /tmp/tmp.$$ as the
> > >    check/test temp file definition results is instant tmp file name
> > >    collisions and random things in check and tests fail.  check and
> > >    common/preamble have to be converted to use `mktemp` to provide
> > >    unique tempfile name prefixes again.
> > > 
> > > 5. Don't forget to revert the parent /proc mount back to shared
> > >    after check has finished running (or was aborted).
> > > 
> > > I think with this (current prototype patch below), we can use PID
> > > namespaces rather than process session IDs for check-parallel safe
> > > process management.
> > > 
> > > Thoughts?
> > 
> > Was about to go to bed, but can we also start a new mount namespace,
> > create a private (or at least non-global) /tmp to put files into, and
> > then each test instance is isolated from clobbering the /tmpfiles of
> > other ./check instances *and* the rest of the system?
> 
> We probably can. I didn't want to go down that rat hole straight
> away, because then I'd have to make a decision about what to mount
> there. One thing at a time....
> 
> I suspect that I can just use a tmpfs filesystem for it - there's
> heaps of memory available on my test machines and we don't use /tmp
> to hold large files, so that should work fine for me.  However, I'm
> a little concerned about what will happen when testing under memory
> pressure situations if /tmp needs memory to operate correctly.
> 
> I'll have a look at what is needed for private tmpfs /tmp instances
> to work - it should work just fine.
> 
> However, if check-parallel has taught me anything, it is that trying
> to use "should work" features on a modern system tends to mean "this
> is a poorly documented rat-hole that with many dead-ends that will
> be explored before a working solution is found"...

<nod> I'm running an experiment overnight with the following patch to
get rid of the session id grossness.  AFAICT it can also be used by
check-parallel to start its subprocesses in separate pid namespaces,
though I didn't investigate thoroughly.

I'm also not sure it's required for check-helper to unmount the /proc
that it creates; merely exiting seems to clean everything up? <shrug>

I also tried using systemd-nspawn to run fstests inside a "container"
but very quickly discovered that you can't pass block devices to the
container which makes fstests pretty useless for testing real scsi
devices. :/

--D

check: run tests in a private pid/mount namespace

Experiment with running tests in a private pid/mount namespace for
better isolation of the scheduler (check) vs the test (./$seq).  This
also makes it cleaner to adjust the oom score and is a convenient place
to set up the environment before invoking the test.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 check        |   30 +++++-------------------------
 check-helper |   31 +++++++++++++++++++++++++++++++
 common/rc    |    8 +++-----
 3 files changed, 39 insertions(+), 30 deletions(-)
 create mode 100755 check-helper

diff --git a/check b/check
index 00ee5af2a31e34..9c70f6f1e10110 100755
--- a/check
+++ b/check
@@ -698,41 +698,21 @@ _adjust_oom_score -500
 # systemd doesn't automatically remove transient scopes that fail to terminate
 # when systemd tells them to terminate (e.g. programs stuck in D state when
 # systemd sends SIGKILL), so we use reset-failed to tear down the scope.
-#
-# Use setsid to run the test program with a separate session id so that we
-# can pkill only the processes started by this test.
 _run_seq() {
-	local cmd=(bash -c "test -w ${OOM_SCORE_ADJ} && echo 250 > ${OOM_SCORE_ADJ}; exec setsid bash ./$seq")
+	local cmd=("./check-helper" "./$seq")
 
 	if [ -n "${HAVE_SYSTEMD_SCOPES}" ]; then
 		local unit="$(systemd-escape "fs$seq").scope"
 		systemctl reset-failed "${unit}" &> /dev/null
-		systemd-run --quiet --unit "${unit}" --scope "${cmd[@]}" &
-		CHILDPID=$!
-		wait
+		systemd-run --quiet --unit "${unit}" --scope "${cmd[@]}"
 		res=$?
-		unset CHILDPID
 		systemctl stop "${unit}" &> /dev/null
 		return "${res}"
 	else
 		# bash won't run the SIGINT trap handler while there are
 		# foreground children in a separate session, so we must run
 		# the test in the background and wait for it.
-		"${cmd[@]}" &
-		CHILDPID=$!
-		wait
-		unset CHILDPID
-	fi
-}
-
-_kill_seq() {
-	if [ -n "$CHILDPID" ]; then
-		# SIGPIPE will kill all the children (including fsstress)
-		# without bash logging fatal signal termination messages to the
-		# console
-		pkill -PIPE --session "$CHILDPID"
-		wait
-		unset CHILDPID
+		"${cmd[@]}"
 	fi
 }
 
@@ -741,9 +721,9 @@ _prepare_test_list
 fstests_start_time="$(date +"%F %T")"
 
 if $OPTIONS_HAVE_SECTIONS; then
-	trap "_kill_seq; _summary; exit \$status" 0 1 2 3 15
+	trap "_summary; exit \$status" 0 1 2 3 15
 else
-	trap "_kill_seq; _wrapup; exit \$status" 0 1 2 3 15
+	trap "_wrapup; exit \$status" 0 1 2 3 15
 fi
 
 function run_section()
diff --git a/check-helper b/check-helper
new file mode 100755
index 00000000000000..2cc8dbe5cfc791
--- /dev/null
+++ b/check-helper
@@ -0,0 +1,31 @@
+#!/bin/bash
+
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2025 Oracle.  All Rights Reserved.
+#
+# Try starting things in a private pid/mount namespace with a private /tmp
+# and /proc so that child process trees cannot interfere with each other.
+
+if [ -n "${IN_NSEXEC}" ]; then
+	for path in /proc /tmp; do
+		mount --make-private "$path"
+	done
+	unset IN_NSEXEC
+	mount -t proc proc /proc
+	mount -t tmpfs tmpfs /tmp
+
+	oom_knob="/proc/self/oom_score_adj"
+	test -w "${oom_knob}" && echo 250 > "${oom_knob}"
+
+	# Run the test, but don't let it be pid 1 because that will confuse
+	# the filter functions in common/dump.
+	"$@"
+	exit
+fi
+
+if [ -z "$1" ] || [ "$1" = "--help" ]; then
+	echo "Usage: $0 command [args...]"
+	exit 1
+fi
+
+IN_NSEXEC=1 exec "$(dirname "$0")/src/nsexec" -m -p $0 "$@"
diff --git a/common/rc b/common/rc
index 1c28a2d190f5a0..cc080ecaa9f801 100644
--- a/common/rc
+++ b/common/rc
@@ -33,13 +33,13 @@ _test_sync()
 # Kill only the test processes started by this test
 _pkill()
 {
-	pkill --session 0 "$@"
+	pkill "$@"
 }
 
 # Find only the test processes started by this test
 _pgrep()
 {
-	pgrep --session 0 "$@"
+	pgrep "$@"
 }
 
 # Common execution handling for fsstress invocation.
@@ -2817,11 +2817,9 @@ _require_user_exists()
 	[ "$?" == "0" ] || _notrun "$user user not defined."
 }
 
-# Run all non-root processes in the same session as the root.  Believe it or
-# not, passing $SHELL in this manner works both for "su" and "su -c cmd".
 _su()
 {
-	su --session-command $SHELL "$@"
+	su "$@"
 }
 
 # check if a user exists and is able to execute commands.
Dave Chinner Jan. 29, 2025, 6:06 a.m. UTC | #12
On Tue, Jan 28, 2025 at 07:13:13PM -0800, Darrick J. Wong wrote:
> On Wed, Jan 29, 2025 at 07:39:22AM +1100, Dave Chinner wrote:
> > On Mon, Jan 27, 2025 at 11:23:52PM -0800, Darrick J. Wong wrote:
> > > On Tue, Jan 28, 2025 at 03:34:50PM +1100, Dave Chinner wrote:
> > > > On Thu, Jan 23, 2025 at 12:16:50PM +1100, Dave Chinner wrote:
> > > > 4. /tmp is still shared across all runner instances so all the
> > > > 
> > > >    concurrent runners dump all their tmp files in /tmp. However, the
> > > >    runners no longer have unique PIDs (i.e. check runs as PID 3 in
> > > >    all runner instaces). This means using /tmp/tmp.$$ as the
> > > >    check/test temp file definition results is instant tmp file name
> > > >    collisions and random things in check and tests fail.  check and
> > > >    common/preamble have to be converted to use `mktemp` to provide
> > > >    unique tempfile name prefixes again.
> > > > 
> > > > 5. Don't forget to revert the parent /proc mount back to shared
> > > >    after check has finished running (or was aborted).
> > > > 
> > > > I think with this (current prototype patch below), we can use PID
> > > > namespaces rather than process session IDs for check-parallel safe
> > > > process management.
> > > > 
> > > > Thoughts?
> > > 
> > > Was about to go to bed, but can we also start a new mount namespace,
> > > create a private (or at least non-global) /tmp to put files into, and
> > > then each test instance is isolated from clobbering the /tmpfiles of
> > > other ./check instances *and* the rest of the system?
> > 
> > We probably can. I didn't want to go down that rat hole straight
> > away, because then I'd have to make a decision about what to mount
> > there. One thing at a time....
> > 
> > I suspect that I can just use a tmpfs filesystem for it - there's
> > heaps of memory available on my test machines and we don't use /tmp
> > to hold large files, so that should work fine for me.  However, I'm
> > a little concerned about what will happen when testing under memory
> > pressure situations if /tmp needs memory to operate correctly.
> > 
> > I'll have a look at what is needed for private tmpfs /tmp instances
> > to work - it should work just fine.
> > 
> > However, if check-parallel has taught me anything, it is that trying
> > to use "should work" features on a modern system tends to mean "this
> > is a poorly documented rat-hole that with many dead-ends that will
> > be explored before a working solution is found"...
> 
> <nod> I'm running an experiment overnight with the following patch to
> get rid of the session id grossness.  AFAICT it can also be used by
> check-parallel to start its subprocesses in separate pid namespaces,
> though I didn't investigate thoroughly.

I don't think check-parallel needs to start each check instance in
it's own PID namespace - it's the tests themselves that need the
isolation from each other.

However, the check instances require a private mount namespace, as
they mount and unmount test/scratch devices themselves and we do not
want other check instances seeing those mounts.

Hence I think the current check-parallel code doing mount namespace
isolation as it already does should work with this patch enabling
per-test process isolation inside check itself.

> I'm also not sure it's required for check-helper to unmount the /proc
> that it creates; merely exiting seems to clean everything up? <shrug>

Yeah, I think tearing down the mount namespace (i.e. exiting the
process that nsexec created) drops the last active reference to the
mounts inside the private namespace and so it gets torn down that
way.

So from my perspective, I think your check-helper namespace patch is
a good improvement and I'll build/fix anything I come across on top
of it. Once your series of fixes goes in, I'll rebase all the stuff
I've got on top it and go from there...

> I also tried using systemd-nspawn to run fstests inside a "container"
> but very quickly discovered that you can't pass block devices to the
> container which makes fstests pretty useless for testing real scsi
> devices. :/

Yet another dead-end in the poorly sign-posted rat-hole, eh?

-Dave.
Darrick J. Wong Jan. 29, 2025, 7:33 a.m. UTC | #13
On Wed, Jan 29, 2025 at 05:06:52PM +1100, Dave Chinner wrote:
> On Tue, Jan 28, 2025 at 07:13:13PM -0800, Darrick J. Wong wrote:
> > On Wed, Jan 29, 2025 at 07:39:22AM +1100, Dave Chinner wrote:
> > > On Mon, Jan 27, 2025 at 11:23:52PM -0800, Darrick J. Wong wrote:
> > > > On Tue, Jan 28, 2025 at 03:34:50PM +1100, Dave Chinner wrote:
> > > > > On Thu, Jan 23, 2025 at 12:16:50PM +1100, Dave Chinner wrote:
> > > > > 4. /tmp is still shared across all runner instances so all the
> > > > > 
> > > > >    concurrent runners dump all their tmp files in /tmp. However, the
> > > > >    runners no longer have unique PIDs (i.e. check runs as PID 3 in
> > > > >    all runner instaces). This means using /tmp/tmp.$$ as the
> > > > >    check/test temp file definition results is instant tmp file name
> > > > >    collisions and random things in check and tests fail.  check and
> > > > >    common/preamble have to be converted to use `mktemp` to provide
> > > > >    unique tempfile name prefixes again.
> > > > > 
> > > > > 5. Don't forget to revert the parent /proc mount back to shared
> > > > >    after check has finished running (or was aborted).
> > > > > 
> > > > > I think with this (current prototype patch below), we can use PID
> > > > > namespaces rather than process session IDs for check-parallel safe
> > > > > process management.
> > > > > 
> > > > > Thoughts?
> > > > 
> > > > Was about to go to bed, but can we also start a new mount namespace,
> > > > create a private (or at least non-global) /tmp to put files into, and
> > > > then each test instance is isolated from clobbering the /tmpfiles of
> > > > other ./check instances *and* the rest of the system?
> > > 
> > > We probably can. I didn't want to go down that rat hole straight
> > > away, because then I'd have to make a decision about what to mount
> > > there. One thing at a time....
> > > 
> > > I suspect that I can just use a tmpfs filesystem for it - there's
> > > heaps of memory available on my test machines and we don't use /tmp
> > > to hold large files, so that should work fine for me.  However, I'm
> > > a little concerned about what will happen when testing under memory
> > > pressure situations if /tmp needs memory to operate correctly.
> > > 
> > > I'll have a look at what is needed for private tmpfs /tmp instances
> > > to work - it should work just fine.
> > > 
> > > However, if check-parallel has taught me anything, it is that trying
> > > to use "should work" features on a modern system tends to mean "this
> > > is a poorly documented rat-hole that with many dead-ends that will
> > > be explored before a working solution is found"...
> > 
> > <nod> I'm running an experiment overnight with the following patch to
> > get rid of the session id grossness.  AFAICT it can also be used by
> > check-parallel to start its subprocesses in separate pid namespaces,
> > though I didn't investigate thoroughly.
> 
> I don't think check-parallel needs to start each check instance in
> it's own PID namespace - it's the tests themselves that need the
> isolation from each other.
> 
> However, the check instances require a private mount namespace, as
> they mount and unmount test/scratch devices themselves and we do not
> want other check instances seeing those mounts.
> 
> Hence I think the current check-parallel code doing mount namespace
> isolation as it already does should work with this patch enabling
> per-test process isolation inside check itself.
> 
> > I'm also not sure it's required for check-helper to unmount the /proc
> > that it creates; merely exiting seems to clean everything up? <shrug>
> 
> Yeah, I think tearing down the mount namespace (i.e. exiting the
> process that nsexec created) drops the last active reference to the
> mounts inside the private namespace and so it gets torn down that
> way.
> 
> So from my perspective, I think your check-helper namespace patch is
> a good improvement and I'll build/fix anything I come across on top
> of it. Once your series of fixes goes in, I'll rebase all the stuff
> I've got on top it and go from there...

<nod> I might reformulate the pkill code to use nsexec (and not systemd)
if it's available; systemd scopes if those are available (I figured out
how to get systemd to tell me the cgroup name); or worst case fall back
to process sessions if neither are available.

I don't know how ancient of a userspace we realistically have to support
since (afaict) namespaces and systemd both showed up around the 2.6.24
era?  But I also don't know if Devuan at least does pid/mount
namespaces.

--D

> > I also tried using systemd-nspawn to run fstests inside a "container"
> > but very quickly discovered that you can't pass block devices to the
> > container which makes fstests pretty useless for testing real scsi
> > devices. :/
> 
> Yet another dead-end in the poorly sign-posted rat-hole, eh?

Yup.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
diff mbox series

Patch

diff --git a/check b/check
index 607d2456e6a1fe..bafe48bf12db67 100755
--- a/check
+++ b/check
@@ -698,18 +698,41 @@  _adjust_oom_score -500
 # systemd doesn't automatically remove transient scopes that fail to terminate
 # when systemd tells them to terminate (e.g. programs stuck in D state when
 # systemd sends SIGKILL), so we use reset-failed to tear down the scope.
+#
+# Use setsid to run the test program with a separate session id so that we
+# can pkill only the processes started by this test.
 _run_seq() {
-	local cmd=(bash -c "test -w ${OOM_SCORE_ADJ} && echo 250 > ${OOM_SCORE_ADJ}; exec ./$seq")
+	local cmd=(bash -c "test -w ${OOM_SCORE_ADJ} && echo 250 > ${OOM_SCORE_ADJ}; exec setsid bash ./$seq")
 
 	if [ -n "${HAVE_SYSTEMD_SCOPES}" ]; then
 		local unit="$(systemd-escape "fs$seq").scope"
 		systemctl reset-failed "${unit}" &> /dev/null
-		systemd-run --quiet --unit "${unit}" --scope "${cmd[@]}"
+		systemd-run --quiet --unit "${unit}" --scope "${cmd[@]}" &
+		CHILDPID=$!
+		wait
 		res=$?
+		unset CHILDPID
 		systemctl stop "${unit}" &> /dev/null
 		return "${res}"
 	else
-		"${cmd[@]}"
+		# bash won't run the SIGINT trap handler while there are
+		# foreground children in a separate session, so we must run
+		# the test in the background and wait for it.
+		"${cmd[@]}" &
+		CHILDPID=$!
+		wait
+		unset CHILDPID
+	fi
+}
+
+_kill_seq() {
+	if [ -n "$CHILDPID" ]; then
+		# SIGPIPE will kill all the children (including fsstress)
+		# without bash logging fatal signal termination messages to the
+		# console
+		pkill -PIPE --session "$CHILDPID"
+		wait
+		unset CHILDPID
 	fi
 }
 
@@ -718,9 +741,9 @@  _prepare_test_list
 fstests_start_time="$(date +"%F %T")"
 
 if $OPTIONS_HAVE_SECTIONS; then
-	trap "_summary; exit \$status" 0 1 2 3 15
+	trap "_kill_seq; _summary; exit \$status" 0 1 2 3 15
 else
-	trap "_wrapup; exit \$status" 0 1 2 3 15
+	trap "_kill_seq; _wrapup; exit \$status" 0 1 2 3 15
 fi
 
 function run_section()
diff --git a/common/fuzzy b/common/fuzzy
index 0a2d91542b561e..772ce7ddcff6d8 100644
--- a/common/fuzzy
+++ b/common/fuzzy
@@ -891,7 +891,7 @@  __stress_xfs_scrub_loop() {
 	local runningfile="$2"
 	local scrub_startat="$3"
 	shift; shift; shift
-	local sigint_ret="$(( $(kill -l SIGINT) + 128 ))"
+	local signal_ret="$(( $(kill -l SIGPIPE) + 128 ))"
 	local scrublog="$tmp.scrub"
 
 	while __stress_scrub_running "$scrub_startat" "$runningfile"; do
@@ -901,8 +901,8 @@  __stress_xfs_scrub_loop() {
 	while __stress_scrub_running "$end" "$runningfile"; do
 		_scratch_scrub "$@" &> $scrublog
 		res=$?
-		if [ "$res" -eq "$sigint_ret" ]; then
-			# Ignore SIGINT because the cleanup function sends
+		if [ "$res" -eq "$signal_ret" ]; then
+			# Ignore SIGPIPE because the cleanup function sends
 			# that to terminate xfs_scrub
 			res=0
 		fi
@@ -1173,13 +1173,11 @@  _scratch_xfs_stress_scrub_cleanup() {
 	rm -f "$runningfile"
 	echo "Cleaning up scrub stress run at $(date)" >> $seqres.full
 
-	# Send SIGINT so that bash won't print a 'Terminated' message that
-	# distorts the golden output.
 	echo "Killing stressor processes at $(date)" >> $seqres.full
-	_kill_fsstress
-	pkill -INT --parent $$ xfs_io >> $seqres.full 2>&1
-	pkill -INT --parent $$ fsx >> $seqres.full 2>&1
-	pkill -INT --parent $$ xfs_scrub >> $seqres.full 2>&1
+	_pkill --echo -PIPE fsstress >> $seqres.full 2>&1
+	_pkill --echo -PIPE xfs_io >> $seqres.full 2>&1
+	_pkill --echo -PIPE fsx >> $seqres.full 2>&1
+	_pkill --echo -PIPE xfs_scrub >> $seqres.full 2>&1
 
 	# Tests are not allowed to exit with the scratch fs frozen.  If we
 	# started a fs freeze/thaw background loop, wait for that loop to exit
@@ -1209,6 +1207,7 @@  _scratch_xfs_stress_scrub_cleanup() {
 	# Wait for the remaining children to exit.
 	echo "Waiting for children to exit at $(date)" >> $seqres.full
 	wait
+	echo "Children exited as of $(date)" >> $seqres.full
 
 	# Ensure the scratch fs is also writable before we exit.
 	if [ -n "$__SCRUB_STRESS_REMOUNT_LOOP" ]; then
diff --git a/common/rc b/common/rc
index 459be11c11c405..d143ba36265c6c 100644
--- a/common/rc
+++ b/common/rc
@@ -30,6 +30,12 @@  _test_sync()
 	_sync_fs $TEST_DIR
 }
 
+# Kill only the test processes started by this test
+_pkill()
+{
+	pkill --session 0 "$@"
+}
+
 # Common execution handling for fsstress invocation.
 #
 # We need per-test fsstress binaries because of the way fsstress forks and
@@ -69,7 +75,7 @@  _kill_fsstress()
 	if [ -n "$_FSSTRESS_PID" ]; then
 		# use SIGPIPE to avoid "Killed" messages from bash
 		echo "killing $_FSSTRESS_BIN" >> $seqres.full
-		pkill -PIPE $_FSSTRESS_BIN >> $seqres.full 2>&1
+		_pkill -PIPE $_FSSTRESS_BIN >> $seqres.full 2>&1
 		_wait_for_fsstress
 		return $?
 	fi
@@ -2740,9 +2746,11 @@  _require_user_exists()
 	[ "$?" == "0" ] || _notrun "$user user not defined."
 }
 
+# Run all non-root processes in the same session as the root.  Believe it or
+# not, passing $SHELL in this manner works both for "su" and "su -c cmd".
 _su()
 {
-	su "$@"
+	su --session-command $SHELL "$@"
 }
 
 # check if a user exists and is able to execute commands.
diff --git a/tests/generic/310 b/tests/generic/310
index 52babfdc803a21..570cc5f3859548 100755
--- a/tests/generic/310
+++ b/tests/generic/310
@@ -29,7 +29,7 @@  _begin_fstest auto
 # Override the default cleanup function.
 _cleanup()
 {
-	pkill -9 $seq.t_readdir > /dev/null 2>&1
+	_pkill -9 $seq.t_readdir > /dev/null 2>&1
 	wait
 	rm -rf $TEST_DIR/tmp
 	rm -f $tmp.*
@@ -83,7 +83,7 @@  _test_read()
 {
 	 $TEST_DIR/$seq.t_readdir_1 $SEQ_DIR > /dev/null 2>&1 &
 	sleep $RUN_TIME
-	pkill -PIPE $seq.t_readdir_1
+	_pkill -PIPE $seq.t_readdir_1
 	wait
 
 	check_kernel_bug
@@ -97,7 +97,7 @@  _test_lseek()
 	$TEST_DIR/$seq.t_readdir_2 $SEQ_DIR > /dev/null 2>&1 &
 	readdir_pid=$!
 	sleep $RUN_TIME
-	pkill -PIPE $seq.t_readdir_2
+	_pkill -PIPE $seq.t_readdir_2
 	wait
 
 	check_kernel_bug
diff --git a/tests/generic/561 b/tests/generic/561
index afe727ac56cbd9..b260aaf16c9256 100755
--- a/tests/generic/561
+++ b/tests/generic/561
@@ -40,7 +40,7 @@  function end_test()
 	# stop duperemove running
 	if [ -e $dupe_run ]; then
 		rm -f $dupe_run
-		pkill $dedup_bin >/dev/null 2>&1
+		_pkill $dedup_bin >/dev/null 2>&1
 		wait $dedup_pids
 		rm -f $dedup_prog
 	fi