Message ID | 173706974197.1927324.9208284704325894988.stgit@frogsfrogsfrogs (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [01/23] generic/476: fix fsstress process management | expand |
On Thu, Jan 16, 2025 at 03:27:15PM -0800, Darrick J. Wong wrote: > From: Darrick J. Wong <djwong@kernel.org> > > Run each test program with a separate session id so that we can tell > pkill to kill all processes of a given name, but only within our own > session id. This /should/ suffice to run multiple fstests on the same > machine without one instance shooting down processes of another > instance. > > This fixes a general problem with using "pkill --parent" -- if the > process being targeted is not a direct descendant of the bash script > calling pkill, then pkill will not do anything. The scrub stress tests > make use of multiple background subshells, which is how a ^C in the > parent process fails to result in fsx/fsstress being killed. Yeah, 'pkill --parent' was the best I had managed to come up that mostly worked, not because it perfect. That was something I wanted feedback on before merge because it still had problems... > This is necessary to fix SOAK_DURATION runtime constraints for all the > scrub stress tests. However, there is a cost -- the test program no > longer runs with the same controlling tty as ./check, which means that > ^Z doesn't work and SIGINT/SIGQUIT are set to SIG_IGN. IOWs, if a test > wants to kill its subprocesses, it must use another signal such as > SIGPIPE. Fortunately, bash doesn't whine about children dying due to > fatal signals if the children run in a different session id. > > I also explored alternate designs, and this was the least unsatisfying: > > a) Setting the process group didn't work because background subshells > are assigned a new group id. Yup, tried that. > b) Constraining the pkill/pgrep search to a cgroup could work, but we'd > have to set up a cgroup in which to run the fstest. thought about that, too, and considered if systemd scopes could do that, but ... > > c) Putting test subprocesses in a systemd sub-scope and telling systemd > to kill the sub-scope could work because ./check can already use it to > ensure that all child processes of a test are killed. However, this is > an *optional* feature, which means that we'd have to require systemd. ... requiring systemd was somewhat of a show-stopper for testing older distros. > d) Constraining the pkill/pgrep search to a particular mount namespace > could work, but we already have tests that set up their own mount > namespaces, which means the constrained pgrep will not find all child > processes of a test. *nod*. > e) Constraining to any other type of namespace (uts, pid, etc) might not > work because those namespaces might not be enabled. *nod* I also tried modifying fsstress to catch and propagate signals and a couple of other ways of managing processes in the stress code, but all ended up having significantly worse downsides than using 'pkill --parent'. I was aware of session IDs, but I've never used them before and hadn't gone down the rabbit hole of working out how to use them when I posted the initial RFC patchset. > f) Revert check-parallel and go back to one fstests instance per system. > Zorro already chose not to revert. > > So. Change _run_seq to create a the ./$seq process with a new session > id, update _su calls to use the same session as the parent test, update > all the pkill sites to use a wrapper so that we only target processes > created by *this* instance of fstests, and update SIGINT to SIGPIPE. Yeah, that's definitely cleaner. ..... > @@ -1173,13 +1173,11 @@ _scratch_xfs_stress_scrub_cleanup() { > rm -f "$runningfile" > echo "Cleaning up scrub stress run at $(date)" >> $seqres.full > > - # Send SIGINT so that bash won't print a 'Terminated' message that > - # distorts the golden output. > echo "Killing stressor processes at $(date)" >> $seqres.full > - _kill_fsstress > - pkill -INT --parent $$ xfs_io >> $seqres.full 2>&1 > - pkill -INT --parent $$ fsx >> $seqres.full 2>&1 > - pkill -INT --parent $$ xfs_scrub >> $seqres.full 2>&1 > + _pkill --echo -PIPE fsstress >> $seqres.full 2>&1 > + _pkill --echo -PIPE xfs_io >> $seqres.full 2>&1 > + _pkill --echo -PIPE fsx >> $seqres.full 2>&1 > + _pkill --echo -PIPE xfs_scrub >> $seqres.full 2>&1 Removing _kill_fsstress is wrong - the fsstress process has already been renamed, so open coding 'pkill fsstress' may not match. The _kill_fsstress() code gets changed to do the right thing here: > @@ -69,7 +75,7 @@ _kill_fsstress() > if [ -n "$_FSSTRESS_PID" ]; then > # use SIGPIPE to avoid "Killed" messages from bash > echo "killing $_FSSTRESS_BIN" >> $seqres.full > - pkill -PIPE $_FSSTRESS_BIN >> $seqres.full 2>&1 > + _pkill -PIPE $_FSSTRESS_BIN >> $seqres.full 2>&1 > _wait_for_fsstress > return $? > fi Then in the next patch when the _FSSTRESS_BIN workaround goes away, _kill_fsstress() is exactly what you open coded in _scratch_xfs_stress_scrub_cleanup().... i.e. common/fuzzy really shouldn't open code the fsstress process management - it should use the wrapper like everything else does. Everything else in the patch looks good. -Dave.
On Tue, Jan 21, 2025 at 02:28:26PM +1100, Dave Chinner wrote: > On Thu, Jan 16, 2025 at 03:27:15PM -0800, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@kernel.org> > > > > Run each test program with a separate session id so that we can tell > > pkill to kill all processes of a given name, but only within our own > > session id. This /should/ suffice to run multiple fstests on the same > > machine without one instance shooting down processes of another > > instance. > > > > This fixes a general problem with using "pkill --parent" -- if the > > process being targeted is not a direct descendant of the bash script > > calling pkill, then pkill will not do anything. The scrub stress tests > > make use of multiple background subshells, which is how a ^C in the > > parent process fails to result in fsx/fsstress being killed. > > Yeah, 'pkill --parent' was the best I had managed to come up that > mostly worked, not because it perfect. That was something I wanted > feedback on before merge because it still had problems... > > > This is necessary to fix SOAK_DURATION runtime constraints for all the > > scrub stress tests. However, there is a cost -- the test program no > > longer runs with the same controlling tty as ./check, which means that > > ^Z doesn't work and SIGINT/SIGQUIT are set to SIG_IGN. IOWs, if a test > > wants to kill its subprocesses, it must use another signal such as > > SIGPIPE. Fortunately, bash doesn't whine about children dying due to > > fatal signals if the children run in a different session id. > > > > I also explored alternate designs, and this was the least unsatisfying: > > > > a) Setting the process group didn't work because background subshells > > are assigned a new group id. > > Yup, tried that. > > > b) Constraining the pkill/pgrep search to a cgroup could work, but we'd > > have to set up a cgroup in which to run the fstest. > > thought about that, too, and considered if systemd scopes could do > that, but ... > > > > > c) Putting test subprocesses in a systemd sub-scope and telling systemd > > to kill the sub-scope could work because ./check can already use it to > > ensure that all child processes of a test are killed. However, this is > > an *optional* feature, which means that we'd have to require systemd. > > ... requiring systemd was somewhat of a show-stopper for testing > older distros. Isn't RHEL7 the oldest one at this point? And it does systemd. At this point the only reason I didn't go full systemd is out of consideration for Devuan, since they probably need QA. > > d) Constraining the pkill/pgrep search to a particular mount namespace > > could work, but we already have tests that set up their own mount > > namespaces, which means the constrained pgrep will not find all child > > processes of a test. > > *nod*. > > > e) Constraining to any other type of namespace (uts, pid, etc) might not > > work because those namespaces might not be enabled. > > *nod* > > I also tried modifying fsstress to catch and propagate signals and a > couple of other ways of managing processes in the stress code, but > all ended up having significantly worse downsides than using 'pkill > --parent'. Yeah, and then you'd still have to figure out fsx and any other random little utility that a test might run in a background. > I was aware of session IDs, but I've never used them before and > hadn't gone down the rabbit hole of working out how to use them when > I posted the initial RFC patchset. <nod> Session IDs kinda suck, but they suck the least for nearly minimal extra effort. > > f) Revert check-parallel and go back to one fstests instance per system. > > Zorro already chose not to revert. > > > > So. Change _run_seq to create a the ./$seq process with a new session > > id, update _su calls to use the same session as the parent test, update > > all the pkill sites to use a wrapper so that we only target processes > > created by *this* instance of fstests, and update SIGINT to SIGPIPE. > > Yeah, that's definitely cleaner. > > ..... > > > @@ -1173,13 +1173,11 @@ _scratch_xfs_stress_scrub_cleanup() { > > rm -f "$runningfile" > > echo "Cleaning up scrub stress run at $(date)" >> $seqres.full > > > > - # Send SIGINT so that bash won't print a 'Terminated' message that > > - # distorts the golden output. > > echo "Killing stressor processes at $(date)" >> $seqres.full > > - _kill_fsstress > > - pkill -INT --parent $$ xfs_io >> $seqres.full 2>&1 > > - pkill -INT --parent $$ fsx >> $seqres.full 2>&1 > > - pkill -INT --parent $$ xfs_scrub >> $seqres.full 2>&1 > > + _pkill --echo -PIPE fsstress >> $seqres.full 2>&1 > > + _pkill --echo -PIPE xfs_io >> $seqres.full 2>&1 > > + _pkill --echo -PIPE fsx >> $seqres.full 2>&1 > > + _pkill --echo -PIPE xfs_scrub >> $seqres.full 2>&1 > > Removing _kill_fsstress is wrong - the fsstress process has already > been renamed, so open coding 'pkill fsstress' may not match. The > _kill_fsstress() code gets changed to do the right thing here: > > > @@ -69,7 +75,7 @@ _kill_fsstress() > > if [ -n "$_FSSTRESS_PID" ]; then > > # use SIGPIPE to avoid "Killed" messages from bash > > echo "killing $_FSSTRESS_BIN" >> $seqres.full > > - pkill -PIPE $_FSSTRESS_BIN >> $seqres.full 2>&1 > > + _pkill -PIPE $_FSSTRESS_BIN >> $seqres.full 2>&1 > > _wait_for_fsstress > > return $? > > fi > > Then in the next patch when the _FSSTRESS_BIN workaround goes away, > _kill_fsstress() is exactly what you open coded in > _scratch_xfs_stress_scrub_cleanup().... > > i.e. common/fuzzy really shouldn't open code the fsstress process > management - it should use the wrapper like everything else does. Ok will change. I suppose I did go fix up the setting (or not) of _FSSTRESS_PID. > Everything else in the patch looks good. Cool! --D > -Dave. > -- > Dave Chinner > david@fromorbit.com >
On Tue, Jan 21, 2025 at 08:24:00PM -0800, Darrick J. Wong wrote: > On Tue, Jan 21, 2025 at 02:28:26PM +1100, Dave Chinner wrote: > > On Thu, Jan 16, 2025 at 03:27:15PM -0800, Darrick J. Wong wrote: > > > c) Putting test subprocesses in a systemd sub-scope and telling systemd > > > to kill the sub-scope could work because ./check can already use it to > > > ensure that all child processes of a test are killed. However, this is > > > an *optional* feature, which means that we'd have to require systemd. > > > > ... requiring systemd was somewhat of a show-stopper for testing > > older distros. > > Isn't RHEL7 the oldest one at this point? And it does systemd. At this > point the only reason I didn't go full systemd is out of consideration > for Devuan, since they probably need QA. I have no idea what is out there in distro land vs what fstests "supports". All I know is that there are distros out there that don't use systemd. It feels like poor form to prevent generic filesystem QA infrastructure from running on those distros by making an avoidable choice to tie the infrastructure exclusively to systemd-based functionality.... -Dave.
On Wed, Jan 22, 2025 at 05:08:17PM +1100, Dave Chinner wrote: > On Tue, Jan 21, 2025 at 08:24:00PM -0800, Darrick J. Wong wrote: > > On Tue, Jan 21, 2025 at 02:28:26PM +1100, Dave Chinner wrote: > > > On Thu, Jan 16, 2025 at 03:27:15PM -0800, Darrick J. Wong wrote: > > > > c) Putting test subprocesses in a systemd sub-scope and telling systemd > > > > to kill the sub-scope could work because ./check can already use it to > > > > ensure that all child processes of a test are killed. However, this is > > > > an *optional* feature, which means that we'd have to require systemd. > > > > > > ... requiring systemd was somewhat of a show-stopper for testing > > > older distros. > > > > Isn't RHEL7 the oldest one at this point? And it does systemd. At this > > point the only reason I didn't go full systemd is out of consideration > > for Devuan, since they probably need QA. > > I have no idea what is out there in distro land vs what fstests > "supports". All I know is that there are distros out there that > don't use systemd. > > It feels like poor form to prevent generic filesystem QA > infrastructure from running on those distros by making an avoidable > choice to tie the infrastructure exclusively to systemd-based > functionality.... Agreed, though at some point after these bugfixes are merged I'll see if I can build on the existing "if you have systemd then ___ else here's your shabby opencoded version" logic in fstests to isolate the ./checks from each other a little better. It'd be kinda nice if we could actually just put them in something resembling a modernish container, albeit with the same underlying fs. <shrug> Anyone else interested in that? --D > -Dave. > -- > Dave Chinner > david@fromorbit.com >
On Tue, Jan 21, 2025 at 11:05:20PM -0800, Darrick J. Wong wrote: > On Wed, Jan 22, 2025 at 05:08:17PM +1100, Dave Chinner wrote: > > On Tue, Jan 21, 2025 at 08:24:00PM -0800, Darrick J. Wong wrote: > > > On Tue, Jan 21, 2025 at 02:28:26PM +1100, Dave Chinner wrote: > > > > On Thu, Jan 16, 2025 at 03:27:15PM -0800, Darrick J. Wong wrote: > > > > > c) Putting test subprocesses in a systemd sub-scope and telling systemd > > > > > to kill the sub-scope could work because ./check can already use it to > > > > > ensure that all child processes of a test are killed. However, this is > > > > > an *optional* feature, which means that we'd have to require systemd. > > > > > > > > ... requiring systemd was somewhat of a show-stopper for testing > > > > older distros. > > > > > > Isn't RHEL7 the oldest one at this point? And it does systemd. At this > > > point the only reason I didn't go full systemd is out of consideration > > > for Devuan, since they probably need QA. > > > > I have no idea what is out there in distro land vs what fstests > > "supports". All I know is that there are distros out there that > > don't use systemd. > > > > It feels like poor form to prevent generic filesystem QA > > infrastructure from running on those distros by making an avoidable > > choice to tie the infrastructure exclusively to systemd-based > > functionality.... > > Agreed, though at some point after these bugfixes are merged I'll see if > I can build on the existing "if you have systemd then ___ else here's > your shabby opencoded version" logic in fstests to isolate the ./checks > from each other a little better. It'd be kinda nice if we could > actually just put them in something resembling a modernish container, > albeit with the same underlying fs. Agreed, but I don't think we need to depend on systemd for that, either. > <shrug> Anyone else interested in that? check-parallel has already started down that road with the mount namespace isolation it uses for the runner tasks via src/nsexec.c. My plan has been to factor more of the check test running code (similar to what I did with the test list parsing) so that the check-parallel can iterate sections itself and runners can execute individual tests directly, rather than bouncing them through check to execute a set of tests serially. Then check-parallel could do whatever it needed to isolate individual tests from each other and nothing in check would need to change. Now I'm wondering if I can just run each runner's check instance in it's own private PID namespace as easily as I'm running them in their own private mount namespace... Hmmm - looks like src/nsexec.c can create new PID namespaces via the "-p" option. I haven't used that before - I wonder if that's a better solution that using per-test session IDs to solve the pkill --parent problem? Something to look into in the morning.... -Dave.
On Wed, Jan 22, 2025 at 08:42:49PM +1100, Dave Chinner wrote: > On Tue, Jan 21, 2025 at 11:05:20PM -0800, Darrick J. Wong wrote: > > On Wed, Jan 22, 2025 at 05:08:17PM +1100, Dave Chinner wrote: > > > On Tue, Jan 21, 2025 at 08:24:00PM -0800, Darrick J. Wong wrote: > > > > On Tue, Jan 21, 2025 at 02:28:26PM +1100, Dave Chinner wrote: > > > > > On Thu, Jan 16, 2025 at 03:27:15PM -0800, Darrick J. Wong wrote: > > > > > > c) Putting test subprocesses in a systemd sub-scope and telling systemd > > > > > > to kill the sub-scope could work because ./check can already use it to > > > > > > ensure that all child processes of a test are killed. However, this is > > > > > > an *optional* feature, which means that we'd have to require systemd. > > > > > > > > > > ... requiring systemd was somewhat of a show-stopper for testing > > > > > older distros. > > > > > > > > Isn't RHEL7 the oldest one at this point? And it does systemd. At this > > > > point the only reason I didn't go full systemd is out of consideration > > > > for Devuan, since they probably need QA. > > > > > > I have no idea what is out there in distro land vs what fstests > > > "supports". All I know is that there are distros out there that > > > don't use systemd. > > > > > > It feels like poor form to prevent generic filesystem QA > > > infrastructure from running on those distros by making an avoidable > > > choice to tie the infrastructure exclusively to systemd-based > > > functionality.... > > > > Agreed, though at some point after these bugfixes are merged I'll see if > > I can build on the existing "if you have systemd then ___ else here's > > your shabby opencoded version" logic in fstests to isolate the ./checks > > from each other a little better. It'd be kinda nice if we could > > actually just put them in something resembling a modernish container, > > albeit with the same underlying fs. > > Agreed, but I don't think we need to depend on systemd for that, > either. > > > <shrug> Anyone else interested in that? > > check-parallel has already started down that road with the > mount namespace isolation it uses for the runner tasks via > src/nsexec.c. > > My plan has been to factor more of the check test running code > (similar to what I did with the test list parsing) so that the > check-parallel can iterate sections itself and runners can execute > individual tests directly, rather than bouncing them through check > to execute a set of tests serially. Then check-parallel could do > whatever it needed to isolate individual tests from each other and > nothing in check would need to change. > > Now I'm wondering if I can just run each runner's check instance > in it's own private PID namespace as easily as I'm running them in > their own private mount namespace... > > Hmmm - looks like src/nsexec.c can create new PID namespaces via > the "-p" option. I haven't used that before - I wonder if that's a > better solution that using per-test session IDs to solve the pkill > --parent problem? Something to look into in the morning.... I tried that -- it appears to work, but then: # ./src/nsexec -p bash Current time: Wed Jan 22 13:43:33 PST 2025; Terminal: /dev/pts/0 # ps fatal library error, lookup self # Regular processes and pkill/pgrep seem to work, but that didn't seem especially encouraging so I stopped. :/ --D > -Dave. > > -- > Dave Chinner > david@fromorbit.com >
On Wed, Jan 22, 2025 at 01:46:09PM -0800, Darrick J. Wong wrote: > On Wed, Jan 22, 2025 at 08:42:49PM +1100, Dave Chinner wrote: > > On Tue, Jan 21, 2025 at 11:05:20PM -0800, Darrick J. Wong wrote: > > > On Wed, Jan 22, 2025 at 05:08:17PM +1100, Dave Chinner wrote: > > > > On Tue, Jan 21, 2025 at 08:24:00PM -0800, Darrick J. Wong wrote: > > > > > On Tue, Jan 21, 2025 at 02:28:26PM +1100, Dave Chinner wrote: > > > > > > On Thu, Jan 16, 2025 at 03:27:15PM -0800, Darrick J. Wong wrote: > > > > > > > c) Putting test subprocesses in a systemd sub-scope and telling systemd > > > > > > > to kill the sub-scope could work because ./check can already use it to > > > > > > > ensure that all child processes of a test are killed. However, this is > > > > > > > an *optional* feature, which means that we'd have to require systemd. > > > > > > > > > > > > ... requiring systemd was somewhat of a show-stopper for testing > > > > > > older distros. > > > > > > > > > > Isn't RHEL7 the oldest one at this point? And it does systemd. At this > > > > > point the only reason I didn't go full systemd is out of consideration > > > > > for Devuan, since they probably need QA. > > > > > > > > I have no idea what is out there in distro land vs what fstests > > > > "supports". All I know is that there are distros out there that > > > > don't use systemd. > > > > > > > > It feels like poor form to prevent generic filesystem QA > > > > infrastructure from running on those distros by making an avoidable > > > > choice to tie the infrastructure exclusively to systemd-based > > > > functionality.... > > > > > > Agreed, though at some point after these bugfixes are merged I'll see if > > > I can build on the existing "if you have systemd then ___ else here's > > > your shabby opencoded version" logic in fstests to isolate the ./checks > > > from each other a little better. It'd be kinda nice if we could > > > actually just put them in something resembling a modernish container, > > > albeit with the same underlying fs. > > > > Agreed, but I don't think we need to depend on systemd for that, > > either. > > > > > <shrug> Anyone else interested in that? > > > > check-parallel has already started down that road with the > > mount namespace isolation it uses for the runner tasks via > > src/nsexec.c. > > > > My plan has been to factor more of the check test running code > > (similar to what I did with the test list parsing) so that the > > check-parallel can iterate sections itself and runners can execute > > individual tests directly, rather than bouncing them through check > > to execute a set of tests serially. Then check-parallel could do > > whatever it needed to isolate individual tests from each other and > > nothing in check would need to change. > > > > Now I'm wondering if I can just run each runner's check instance > > in it's own private PID namespace as easily as I'm running them in > > their own private mount namespace... > > > > Hmmm - looks like src/nsexec.c can create new PID namespaces via > > the "-p" option. I haven't used that before - I wonder if that's a > > better solution that using per-test session IDs to solve the pkill > > --parent problem? Something to look into in the morning.... > > I tried that -- it appears to work, but then: > > # ./src/nsexec -p bash > Current time: Wed Jan 22 13:43:33 PST 2025; Terminal: /dev/pts/0 > # ps > fatal library error, lookup self > # That looks like a bug in whatever distro you are using - it works as it should here on a recent debian unstable userspace. Note, however, that ps will show all processes in both the parent and the child namespace the shell is running on because the contents of /proc are the same for both. However, because we are also using private mount namespaces for the check process, pid_namespaces(7) tells us: /proc and PID namespaces A /proc filesystem shows (in the /proc/pid directories) only processes visible in the PID namespace of the process that performed the mount, even if the /proc filesystem is viewed from processes in other namespaces. After creating a new PID namespace, it is useful for the child to change its root directory and mount a new procfs instance at /proc so that tools such as ps(1) work >>> correctly. If a new mount namespace is simultaneously >>> created by including CLONE_NEWNS in the flags argument of >>> clone(2) or unshare(2), then it isn't necessary to change the >>> root directory: a new procfs instance can be mounted directly >>> over /proc. From a shell, the command to mount /proc is: $ mount -t proc proc /proc Calling readlink(2) on the path /proc/self yields the process ID of the caller in the PID namespace of the procfs mount (i.e., the PID name‐ space of the process that mounted the procfs). This can be useful for introspection purposes, when a process wants to discover its PID in other namespaces. This appears to give us an environment that only shows the processes within the current PID namespace: $ sudo src/nsexec -p -m bash # mount -t proc proc /proc # ps waux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 7384 3744 pts/1 S 11:55 0:00 bash root 72 0.0 0.0 8300 3736 pts/1 R+ 12:04 0:00 ps waux # pstree -N pid [4026538173] bash───pstree # Yup, there we go - we have full PID isolation for this shell. OK, I suspect this is a better way to proceed for check-parallel than to need session IDs for individual tests and wrappers for pgrep/pkill/pidwait. Let me see what breaks when I use this..... -Dave.
On Thu, Jan 23, 2025 at 12:16:50PM +1100, Dave Chinner wrote: > On Wed, Jan 22, 2025 at 01:46:09PM -0800, Darrick J. Wong wrote: > > > > Agreed, though at some point after these bugfixes are merged I'll see if > > > > I can build on the existing "if you have systemd then ___ else here's > > > > your shabby opencoded version" logic in fstests to isolate the ./checks > > > > from each other a little better. It'd be kinda nice if we could > > > > actually just put them in something resembling a modernish container, > > > > albeit with the same underlying fs. > > > > > > Agreed, but I don't think we need to depend on systemd for that, > > > either. > > > > > > > <shrug> Anyone else interested in that? > > > > > > check-parallel has already started down that road with the > > > mount namespace isolation it uses for the runner tasks via > > > src/nsexec.c. ..... > > > Hmmm - looks like src/nsexec.c can create new PID namespaces via > > > the "-p" option. I haven't used that before - I wonder if that's a > > > better solution that using per-test session IDs to solve the pkill > > > --parent problem? Something to look into in the morning.... ..... > Note, however, that ps will show all processes in both the parent > and the child namespace the shell is running on because the contents > of /proc are the same for both. > > However, because we are also using private mount namespaces for the > check process, pid_namespaces(7) tells us: > > /proc and PID namespaces > > A /proc filesystem shows (in the /proc/pid directories) only > processes visible in the PID namespace of the process that > performed the mount, even if the /proc filesystem is viewed > from processes in other namespaces. > > After creating a new PID namespace, it is useful for the > child to change its root directory and mount a new procfs > instance at /proc so that tools such as ps(1) work > >>> correctly. If a new mount namespace is simultaneously > >>> created by including CLONE_NEWNS in the flags argument of > >>> clone(2) or unshare(2), then it isn't necessary to change the > >>> root directory: a new procfs instance can be mounted directly > >>> over /proc. > > From a shell, the command to mount /proc is: > > $ mount -t proc proc /proc > > Calling readlink(2) on the path /proc/self yields the process > ID of the caller in the PID namespace of the procfs mount > (i.e., the PID name‐ space of the process that mounted the > procfs). This can be useful for introspection purposes, when > a process wants to discover its PID in other namespaces. > > This appears to give us an environment that only shows the processes > within the current PID namespace: > > $ sudo src/nsexec -p -m bash > # mount -t proc proc /proc > # ps waux > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > root 1 0.0 0.0 7384 3744 pts/1 S 11:55 0:00 bash > root 72 0.0 0.0 8300 3736 pts/1 R+ 12:04 0:00 ps waux > # pstree -N pid > [4026538173] > bash───pstree > # > > Yup, there we go - we have full PID isolation for this shell. Ok, it took me some time to get this to work reliably - the way was full of landmines with little documentation to guide around them. 1. If multiple commands are needed to be run from nsexec, a helper script is needed (I called it check-helper). 2. You have to `mount --make-private /proc` before doing anything to make the mounts of /proc inside the pid namespace work correctly. If you don't do this, every other mount namespace will also see only the newest mount, and that means every runner but the last one to start with the wrong mount and PID namespace view in /proc. 3. if you get the system into the state described by 1), unmounting /proc in each runner then races to remove the top /proc mount. Many threads get -EBUSY failures from unmount, leaving many stale mounts of /proc behind. 4. the top level shell where check-parallel was invoked is left with the view where /proc has been unmounted from a PID/mount namespace that has gone away. Hence /proc now has no processes or mounts being reported and nothing works until you mount a new instance /proc in that namespace. 5. After mounting proc again there are lots of mounts of stale /proc mounts still reported. They cannot be unmounted as the mount namespaces that created them have gone away and unmounting /proc in the current shell simply removes the last mounted one and we goto 4). 4. /tmp is still shared across all runner instances so all the concurrent runners dump all their tmp files in /tmp. However, the runners no longer have unique PIDs (i.e. check runs as PID 3 in all runner instaces). This means using /tmp/tmp.$$ as the check/test temp file definition results is instant tmp file name collisions and random things in check and tests fail. check and common/preamble have to be converted to use `mktemp` to provide unique tempfile name prefixes again. 5. Don't forget to revert the parent /proc mount back to shared after check has finished running (or was aborted). I think with this (current prototype patch below), we can use PID namespaces rather than process session IDs for check-parallel safe process management. Thoughts? -Dave.
On Tue, Jan 28, 2025 at 03:34:50PM +1100, Dave Chinner wrote: > On Thu, Jan 23, 2025 at 12:16:50PM +1100, Dave Chinner wrote: > > On Wed, Jan 22, 2025 at 01:46:09PM -0800, Darrick J. Wong wrote: > > > > > Agreed, though at some point after these bugfixes are merged I'll see if > > > > > I can build on the existing "if you have systemd then ___ else here's > > > > > your shabby opencoded version" logic in fstests to isolate the ./checks > > > > > from each other a little better. It'd be kinda nice if we could > > > > > actually just put them in something resembling a modernish container, > > > > > albeit with the same underlying fs. > > > > > > > > Agreed, but I don't think we need to depend on systemd for that, > > > > either. > > > > > > > > > <shrug> Anyone else interested in that? > > > > > > > > check-parallel has already started down that road with the > > > > mount namespace isolation it uses for the runner tasks via > > > > src/nsexec.c. > ..... > > > > Hmmm - looks like src/nsexec.c can create new PID namespaces via > > > > the "-p" option. I haven't used that before - I wonder if that's a > > > > better solution that using per-test session IDs to solve the pkill > > > > --parent problem? Something to look into in the morning.... > > ..... > > > Note, however, that ps will show all processes in both the parent > > and the child namespace the shell is running on because the contents > > of /proc are the same for both. > > > > However, because we are also using private mount namespaces for the > > check process, pid_namespaces(7) tells us: > > > > /proc and PID namespaces > > > > A /proc filesystem shows (in the /proc/pid directories) only > > processes visible in the PID namespace of the process that > > performed the mount, even if the /proc filesystem is viewed > > from processes in other namespaces. > > > > After creating a new PID namespace, it is useful for the > > child to change its root directory and mount a new procfs > > instance at /proc so that tools such as ps(1) work > > >>> correctly. If a new mount namespace is simultaneously > > >>> created by including CLONE_NEWNS in the flags argument of > > >>> clone(2) or unshare(2), then it isn't necessary to change the > > >>> root directory: a new procfs instance can be mounted directly > > >>> over /proc. > > > > From a shell, the command to mount /proc is: > > > > $ mount -t proc proc /proc > > > > Calling readlink(2) on the path /proc/self yields the process > > ID of the caller in the PID namespace of the procfs mount > > (i.e., the PID name‐ space of the process that mounted the > > procfs). This can be useful for introspection purposes, when > > a process wants to discover its PID in other namespaces. > > > > This appears to give us an environment that only shows the processes > > within the current PID namespace: > > > > $ sudo src/nsexec -p -m bash > > # mount -t proc proc /proc > > # ps waux > > USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > > root 1 0.0 0.0 7384 3744 pts/1 S 11:55 0:00 bash > > root 72 0.0 0.0 8300 3736 pts/1 R+ 12:04 0:00 ps waux > > # pstree -N pid > > [4026538173] > > bash───pstree > > # > > > > Yup, there we go - we have full PID isolation for this shell. > > Ok, it took me some time to get this to work reliably - the way was > full of landmines with little documentation to guide around them. > > 1. If multiple commands are needed to be run from nsexec, a helper > script is needed (I called it check-helper). > > 2. You have to `mount --make-private /proc` before doing anything to > make the mounts of /proc inside the pid namespace work correctly. > If you don't do this, every other mount namespace will also see > only the newest mount, and that means every runner but the last > one to start with the wrong mount and PID namespace view in /proc. > > 3. if you get the system into the state described by 1), unmounting > /proc in each runner then races to remove the top /proc mount. > Many threads get -EBUSY failures from unmount, leaving many stale > mounts of /proc behind. > > 4. the top level shell where check-parallel was invoked is left with > the view where /proc has been unmounted from a PID/mount > namespace that has gone away. Hence /proc now has no processes or > mounts being reported and nothing works until you mount a new > instance /proc in that namespace. > > 5. After mounting proc again there are lots of mounts of stale /proc > mounts still reported. They cannot be unmounted as the mount > namespaces that created them have gone away and unmounting /proc > in the current shell simply removes the last mounted one and we > goto 4). > > 4. /tmp is still shared across all runner instances so all the > > concurrent runners dump all their tmp files in /tmp. However, the > runners no longer have unique PIDs (i.e. check runs as PID 3 in > all runner instaces). This means using /tmp/tmp.$$ as the > check/test temp file definition results is instant tmp file name > collisions and random things in check and tests fail. check and > common/preamble have to be converted to use `mktemp` to provide > unique tempfile name prefixes again. > > 5. Don't forget to revert the parent /proc mount back to shared > after check has finished running (or was aborted). > > I think with this (current prototype patch below), we can use PID > namespaces rather than process session IDs for check-parallel safe > process management. > > Thoughts? Was about to go to bed, but can we also start a new mount namespace, create a private (or at least non-global) /tmp to put files into, and then each test instance is isolated from clobbering the /tmpfiles of other ./check instances *and* the rest of the system? --D > -Dave. > -- > Dave Chinner > david@fromorbit.com > > > check-parallel: use PID namespaces for runner process isolation > > From: Dave Chinner <dchinner@redhat.com> > > This provides isolation between individual runners so that they > cannot see the processes that other test runners have created. > This means tools like pkill will only find processes run by the test > that is calling it, hence there is no danger taht it might kill > processes owned by a different test in a different runner context. > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- > check | 6 +++++- > check-helper | 23 +++++++++++++++++++++++ > check-parallel | 25 +++++++++++++++++++------ > common/preamble | 5 ++++- > 4 files changed, 51 insertions(+), 8 deletions(-) > > diff --git a/check b/check > index 4dc266dcf..314436667 100755 > --- a/check > +++ b/check > @@ -4,7 +4,11 @@ > # > # Control script for QA > # > -tmp=/tmp/$$ > + > +# When run from a pid namespace, /tmp/tmp.$$ is not a unique identifier. > +# Use mktemp instead. > +tmp=`mktemp` > + > status=0 > needwrap=true > needsum=true > diff --git a/check-helper b/check-helper > new file mode 100755 > index 000000000..47a92de8b > --- /dev/null > +++ b/check-helper > @@ -0,0 +1,23 @@ > +#!/bin/bash > +# SPDX-License-Identifier: GPL-2.0 > +# Copyright (c) 2025 Red Hat, Inc. All Rights Reserved. > +# > + > +# When check is run from check-parallel, it is run in private mount and PID > +# namespaces. We need to set up /proc to only show processes in this PID > +# namespace, so we mount a new instance of /proc over the top of the existing > +# version. Because we are in a private mount namespace, the does not propagate > +# outside this context and hence does not conflict with the other test runners > +# that are performing the same setup actions. > + > +args="$@" > + > +#echo $args > +mount -t proc proc /proc > +ret=$? > +if [ $ret -eq 0 ]; then > + ./check $args > + umount -l /proc > +else > + echo failed to mount /proc, ret $ret! > +fi > diff --git a/check-parallel b/check-parallel > index 2fbf0fdbe..6082baf5e 100755 > --- a/check-parallel > +++ b/check-parallel > @@ -259,14 +259,14 @@ runner_go() > rm -f $RESULT_BASE/check.* > > # Only supports default mkfs parameters right now > - wipefs -a $TEST_DEV > /dev/null 2>&1 > - yes | mkfs -t $FSTYP $TEST_MKFS_OPTS $TEST_DEV > /dev/null 2>&1 > + wipefs -a $TEST_DEV > $me/log 2>&1 > + yes | mkfs -t $FSTYP $TEST_MKFS_OPTS $TEST_DEV >> $me/log 2>&1 > > # export DUMP_CORRUPT_FS=1 > > - # Run the tests in it's own mount namespace, as per the comment below > - # that precedes making the basedir a private mount. > - ./src/nsexec -m ./check $run_section -x unreliable_in_parallel --exact-order ${runner_list[$id]} > $me/log 2>&1 > + # Run the tests in it's own mount and PID namespace, as per the comment > + # below that precedes making the basedir a private mount. > + ./src/nsexec -m -p ./check-helper $run_section -x unreliable_in_parallel --exact-order ${runner_list[$id]} >> $me/log 2>&1 > > wait > sleep 1 > @@ -291,6 +291,8 @@ cleanup() > umount -R $basedir/*/test 2> /dev/null > umount -R $basedir/*/scratch 2> /dev/null > losetup --detach-all > + mount --make-shared /proc > + mount --make-shared $basedir > } > > trap "cleanup; exit" HUP INT QUIT TERM > @@ -311,10 +313,17 @@ fi > # controls the mount to succeed without actually unmounting the filesytsem > # because a mount namespace still holds a reference to it. This causes other > # operations on the block device to fail as it is still busy (e.g. fsck, mkfs, > -# etc). Hence we make the basedir private here and then run each check instance > +# etc). > +# > +# Hence we make the basedir private here and then run each check instance > # in it's own mount namespace so that they cannot see mounts that other tests > # are performing. > +# > +# We also need to make /proc private so that the runners can be run cleanly in > +# a PID namespace. This requires an new mount of /proc inside the PID namespace, > +# and this requires a private mount namespace to work correctly. > mount --make-private $basedir > +mount --make-private /proc > > now=`date +%Y-%m-%d-%H:%M:%S` > for ((i = 0; i < $runners; i++)); do > @@ -328,6 +337,10 @@ for ((i = 0; i < $runners; i++)); do > done; > wait > > +# Restore the parents back to shared mount namespace behaviour. > +mount --make-shared /proc > +mount --make-shared $basedir > + > if [ -n "$show_test_list" ]; then > exit 0 > fi > diff --git a/common/preamble b/common/preamble > index 78e45d522..8f47b172a 100644 > --- a/common/preamble > +++ b/common/preamble > @@ -43,8 +43,11 @@ _begin_fstest() > seqres=$RESULT_DIR/$seq > echo "QA output created by $seq" > > + # When run from a pid namespace, /tmp/tmp.$$ is not a unique identifier. > + # Use mktemp instead. > + tmp=`mktemp` > + > here=`pwd` > - tmp=/tmp/$$ > status=1 # failure is the default! > > _register_cleanup _cleanup
On Mon, Jan 27, 2025 at 11:23:52PM -0800, Darrick J. Wong wrote: > On Tue, Jan 28, 2025 at 03:34:50PM +1100, Dave Chinner wrote: > > On Thu, Jan 23, 2025 at 12:16:50PM +1100, Dave Chinner wrote: > > 4. /tmp is still shared across all runner instances so all the > > > > concurrent runners dump all their tmp files in /tmp. However, the > > runners no longer have unique PIDs (i.e. check runs as PID 3 in > > all runner instaces). This means using /tmp/tmp.$$ as the > > check/test temp file definition results is instant tmp file name > > collisions and random things in check and tests fail. check and > > common/preamble have to be converted to use `mktemp` to provide > > unique tempfile name prefixes again. > > > > 5. Don't forget to revert the parent /proc mount back to shared > > after check has finished running (or was aborted). > > > > I think with this (current prototype patch below), we can use PID > > namespaces rather than process session IDs for check-parallel safe > > process management. > > > > Thoughts? > > Was about to go to bed, but can we also start a new mount namespace, > create a private (or at least non-global) /tmp to put files into, and > then each test instance is isolated from clobbering the /tmpfiles of > other ./check instances *and* the rest of the system? We probably can. I didn't want to go down that rat hole straight away, because then I'd have to make a decision about what to mount there. One thing at a time.... I suspect that I can just use a tmpfs filesystem for it - there's heaps of memory available on my test machines and we don't use /tmp to hold large files, so that should work fine for me. However, I'm a little concerned about what will happen when testing under memory pressure situations if /tmp needs memory to operate correctly. I'll have a look at what is needed for private tmpfs /tmp instances to work - it should work just fine. However, if check-parallel has taught me anything, it is that trying to use "should work" features on a modern system tends to mean "this is a poorly documented rat-hole that with many dead-ends that will be explored before a working solution is found"... -Dave.
On Wed, Jan 29, 2025 at 07:39:22AM +1100, Dave Chinner wrote: > On Mon, Jan 27, 2025 at 11:23:52PM -0800, Darrick J. Wong wrote: > > On Tue, Jan 28, 2025 at 03:34:50PM +1100, Dave Chinner wrote: > > > On Thu, Jan 23, 2025 at 12:16:50PM +1100, Dave Chinner wrote: > > > 4. /tmp is still shared across all runner instances so all the > > > > > > concurrent runners dump all their tmp files in /tmp. However, the > > > runners no longer have unique PIDs (i.e. check runs as PID 3 in > > > all runner instaces). This means using /tmp/tmp.$$ as the > > > check/test temp file definition results is instant tmp file name > > > collisions and random things in check and tests fail. check and > > > common/preamble have to be converted to use `mktemp` to provide > > > unique tempfile name prefixes again. > > > > > > 5. Don't forget to revert the parent /proc mount back to shared > > > after check has finished running (or was aborted). > > > > > > I think with this (current prototype patch below), we can use PID > > > namespaces rather than process session IDs for check-parallel safe > > > process management. > > > > > > Thoughts? > > > > Was about to go to bed, but can we also start a new mount namespace, > > create a private (or at least non-global) /tmp to put files into, and > > then each test instance is isolated from clobbering the /tmpfiles of > > other ./check instances *and* the rest of the system? > > We probably can. I didn't want to go down that rat hole straight > away, because then I'd have to make a decision about what to mount > there. One thing at a time.... > > I suspect that I can just use a tmpfs filesystem for it - there's > heaps of memory available on my test machines and we don't use /tmp > to hold large files, so that should work fine for me. However, I'm > a little concerned about what will happen when testing under memory > pressure situations if /tmp needs memory to operate correctly. > > I'll have a look at what is needed for private tmpfs /tmp instances > to work - it should work just fine. > > However, if check-parallel has taught me anything, it is that trying > to use "should work" features on a modern system tends to mean "this > is a poorly documented rat-hole that with many dead-ends that will > be explored before a working solution is found"... <nod> I'm running an experiment overnight with the following patch to get rid of the session id grossness. AFAICT it can also be used by check-parallel to start its subprocesses in separate pid namespaces, though I didn't investigate thoroughly. I'm also not sure it's required for check-helper to unmount the /proc that it creates; merely exiting seems to clean everything up? <shrug> I also tried using systemd-nspawn to run fstests inside a "container" but very quickly discovered that you can't pass block devices to the container which makes fstests pretty useless for testing real scsi devices. :/ --D check: run tests in a private pid/mount namespace Experiment with running tests in a private pid/mount namespace for better isolation of the scheduler (check) vs the test (./$seq). This also makes it cleaner to adjust the oom score and is a convenient place to set up the environment before invoking the test. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> --- check | 30 +++++------------------------- check-helper | 31 +++++++++++++++++++++++++++++++ common/rc | 8 +++----- 3 files changed, 39 insertions(+), 30 deletions(-) create mode 100755 check-helper diff --git a/check b/check index 00ee5af2a31e34..9c70f6f1e10110 100755 --- a/check +++ b/check @@ -698,41 +698,21 @@ _adjust_oom_score -500 # systemd doesn't automatically remove transient scopes that fail to terminate # when systemd tells them to terminate (e.g. programs stuck in D state when # systemd sends SIGKILL), so we use reset-failed to tear down the scope. -# -# Use setsid to run the test program with a separate session id so that we -# can pkill only the processes started by this test. _run_seq() { - local cmd=(bash -c "test -w ${OOM_SCORE_ADJ} && echo 250 > ${OOM_SCORE_ADJ}; exec setsid bash ./$seq") + local cmd=("./check-helper" "./$seq") if [ -n "${HAVE_SYSTEMD_SCOPES}" ]; then local unit="$(systemd-escape "fs$seq").scope" systemctl reset-failed "${unit}" &> /dev/null - systemd-run --quiet --unit "${unit}" --scope "${cmd[@]}" & - CHILDPID=$! - wait + systemd-run --quiet --unit "${unit}" --scope "${cmd[@]}" res=$? - unset CHILDPID systemctl stop "${unit}" &> /dev/null return "${res}" else # bash won't run the SIGINT trap handler while there are # foreground children in a separate session, so we must run # the test in the background and wait for it. - "${cmd[@]}" & - CHILDPID=$! - wait - unset CHILDPID - fi -} - -_kill_seq() { - if [ -n "$CHILDPID" ]; then - # SIGPIPE will kill all the children (including fsstress) - # without bash logging fatal signal termination messages to the - # console - pkill -PIPE --session "$CHILDPID" - wait - unset CHILDPID + "${cmd[@]}" fi } @@ -741,9 +721,9 @@ _prepare_test_list fstests_start_time="$(date +"%F %T")" if $OPTIONS_HAVE_SECTIONS; then - trap "_kill_seq; _summary; exit \$status" 0 1 2 3 15 + trap "_summary; exit \$status" 0 1 2 3 15 else - trap "_kill_seq; _wrapup; exit \$status" 0 1 2 3 15 + trap "_wrapup; exit \$status" 0 1 2 3 15 fi function run_section() diff --git a/check-helper b/check-helper new file mode 100755 index 00000000000000..2cc8dbe5cfc791 --- /dev/null +++ b/check-helper @@ -0,0 +1,31 @@ +#!/bin/bash + +# SPDX-License-Identifier: GPL-2.0 +# Copyright (c) 2025 Oracle. All Rights Reserved. +# +# Try starting things in a private pid/mount namespace with a private /tmp +# and /proc so that child process trees cannot interfere with each other. + +if [ -n "${IN_NSEXEC}" ]; then + for path in /proc /tmp; do + mount --make-private "$path" + done + unset IN_NSEXEC + mount -t proc proc /proc + mount -t tmpfs tmpfs /tmp + + oom_knob="/proc/self/oom_score_adj" + test -w "${oom_knob}" && echo 250 > "${oom_knob}" + + # Run the test, but don't let it be pid 1 because that will confuse + # the filter functions in common/dump. + "$@" + exit +fi + +if [ -z "$1" ] || [ "$1" = "--help" ]; then + echo "Usage: $0 command [args...]" + exit 1 +fi + +IN_NSEXEC=1 exec "$(dirname "$0")/src/nsexec" -m -p $0 "$@" diff --git a/common/rc b/common/rc index 1c28a2d190f5a0..cc080ecaa9f801 100644 --- a/common/rc +++ b/common/rc @@ -33,13 +33,13 @@ _test_sync() # Kill only the test processes started by this test _pkill() { - pkill --session 0 "$@" + pkill "$@" } # Find only the test processes started by this test _pgrep() { - pgrep --session 0 "$@" + pgrep "$@" } # Common execution handling for fsstress invocation. @@ -2817,11 +2817,9 @@ _require_user_exists() [ "$?" == "0" ] || _notrun "$user user not defined." } -# Run all non-root processes in the same session as the root. Believe it or -# not, passing $SHELL in this manner works both for "su" and "su -c cmd". _su() { - su --session-command $SHELL "$@" + su "$@" } # check if a user exists and is able to execute commands.
On Tue, Jan 28, 2025 at 07:13:13PM -0800, Darrick J. Wong wrote: > On Wed, Jan 29, 2025 at 07:39:22AM +1100, Dave Chinner wrote: > > On Mon, Jan 27, 2025 at 11:23:52PM -0800, Darrick J. Wong wrote: > > > On Tue, Jan 28, 2025 at 03:34:50PM +1100, Dave Chinner wrote: > > > > On Thu, Jan 23, 2025 at 12:16:50PM +1100, Dave Chinner wrote: > > > > 4. /tmp is still shared across all runner instances so all the > > > > > > > > concurrent runners dump all their tmp files in /tmp. However, the > > > > runners no longer have unique PIDs (i.e. check runs as PID 3 in > > > > all runner instaces). This means using /tmp/tmp.$$ as the > > > > check/test temp file definition results is instant tmp file name > > > > collisions and random things in check and tests fail. check and > > > > common/preamble have to be converted to use `mktemp` to provide > > > > unique tempfile name prefixes again. > > > > > > > > 5. Don't forget to revert the parent /proc mount back to shared > > > > after check has finished running (or was aborted). > > > > > > > > I think with this (current prototype patch below), we can use PID > > > > namespaces rather than process session IDs for check-parallel safe > > > > process management. > > > > > > > > Thoughts? > > > > > > Was about to go to bed, but can we also start a new mount namespace, > > > create a private (or at least non-global) /tmp to put files into, and > > > then each test instance is isolated from clobbering the /tmpfiles of > > > other ./check instances *and* the rest of the system? > > > > We probably can. I didn't want to go down that rat hole straight > > away, because then I'd have to make a decision about what to mount > > there. One thing at a time.... > > > > I suspect that I can just use a tmpfs filesystem for it - there's > > heaps of memory available on my test machines and we don't use /tmp > > to hold large files, so that should work fine for me. However, I'm > > a little concerned about what will happen when testing under memory > > pressure situations if /tmp needs memory to operate correctly. > > > > I'll have a look at what is needed for private tmpfs /tmp instances > > to work - it should work just fine. > > > > However, if check-parallel has taught me anything, it is that trying > > to use "should work" features on a modern system tends to mean "this > > is a poorly documented rat-hole that with many dead-ends that will > > be explored before a working solution is found"... > > <nod> I'm running an experiment overnight with the following patch to > get rid of the session id grossness. AFAICT it can also be used by > check-parallel to start its subprocesses in separate pid namespaces, > though I didn't investigate thoroughly. I don't think check-parallel needs to start each check instance in it's own PID namespace - it's the tests themselves that need the isolation from each other. However, the check instances require a private mount namespace, as they mount and unmount test/scratch devices themselves and we do not want other check instances seeing those mounts. Hence I think the current check-parallel code doing mount namespace isolation as it already does should work with this patch enabling per-test process isolation inside check itself. > I'm also not sure it's required for check-helper to unmount the /proc > that it creates; merely exiting seems to clean everything up? <shrug> Yeah, I think tearing down the mount namespace (i.e. exiting the process that nsexec created) drops the last active reference to the mounts inside the private namespace and so it gets torn down that way. So from my perspective, I think your check-helper namespace patch is a good improvement and I'll build/fix anything I come across on top of it. Once your series of fixes goes in, I'll rebase all the stuff I've got on top it and go from there... > I also tried using systemd-nspawn to run fstests inside a "container" > but very quickly discovered that you can't pass block devices to the > container which makes fstests pretty useless for testing real scsi > devices. :/ Yet another dead-end in the poorly sign-posted rat-hole, eh? -Dave.
On Wed, Jan 29, 2025 at 05:06:52PM +1100, Dave Chinner wrote: > On Tue, Jan 28, 2025 at 07:13:13PM -0800, Darrick J. Wong wrote: > > On Wed, Jan 29, 2025 at 07:39:22AM +1100, Dave Chinner wrote: > > > On Mon, Jan 27, 2025 at 11:23:52PM -0800, Darrick J. Wong wrote: > > > > On Tue, Jan 28, 2025 at 03:34:50PM +1100, Dave Chinner wrote: > > > > > On Thu, Jan 23, 2025 at 12:16:50PM +1100, Dave Chinner wrote: > > > > > 4. /tmp is still shared across all runner instances so all the > > > > > > > > > > concurrent runners dump all their tmp files in /tmp. However, the > > > > > runners no longer have unique PIDs (i.e. check runs as PID 3 in > > > > > all runner instaces). This means using /tmp/tmp.$$ as the > > > > > check/test temp file definition results is instant tmp file name > > > > > collisions and random things in check and tests fail. check and > > > > > common/preamble have to be converted to use `mktemp` to provide > > > > > unique tempfile name prefixes again. > > > > > > > > > > 5. Don't forget to revert the parent /proc mount back to shared > > > > > after check has finished running (or was aborted). > > > > > > > > > > I think with this (current prototype patch below), we can use PID > > > > > namespaces rather than process session IDs for check-parallel safe > > > > > process management. > > > > > > > > > > Thoughts? > > > > > > > > Was about to go to bed, but can we also start a new mount namespace, > > > > create a private (or at least non-global) /tmp to put files into, and > > > > then each test instance is isolated from clobbering the /tmpfiles of > > > > other ./check instances *and* the rest of the system? > > > > > > We probably can. I didn't want to go down that rat hole straight > > > away, because then I'd have to make a decision about what to mount > > > there. One thing at a time.... > > > > > > I suspect that I can just use a tmpfs filesystem for it - there's > > > heaps of memory available on my test machines and we don't use /tmp > > > to hold large files, so that should work fine for me. However, I'm > > > a little concerned about what will happen when testing under memory > > > pressure situations if /tmp needs memory to operate correctly. > > > > > > I'll have a look at what is needed for private tmpfs /tmp instances > > > to work - it should work just fine. > > > > > > However, if check-parallel has taught me anything, it is that trying > > > to use "should work" features on a modern system tends to mean "this > > > is a poorly documented rat-hole that with many dead-ends that will > > > be explored before a working solution is found"... > > > > <nod> I'm running an experiment overnight with the following patch to > > get rid of the session id grossness. AFAICT it can also be used by > > check-parallel to start its subprocesses in separate pid namespaces, > > though I didn't investigate thoroughly. > > I don't think check-parallel needs to start each check instance in > it's own PID namespace - it's the tests themselves that need the > isolation from each other. > > However, the check instances require a private mount namespace, as > they mount and unmount test/scratch devices themselves and we do not > want other check instances seeing those mounts. > > Hence I think the current check-parallel code doing mount namespace > isolation as it already does should work with this patch enabling > per-test process isolation inside check itself. > > > I'm also not sure it's required for check-helper to unmount the /proc > > that it creates; merely exiting seems to clean everything up? <shrug> > > Yeah, I think tearing down the mount namespace (i.e. exiting the > process that nsexec created) drops the last active reference to the > mounts inside the private namespace and so it gets torn down that > way. > > So from my perspective, I think your check-helper namespace patch is > a good improvement and I'll build/fix anything I come across on top > of it. Once your series of fixes goes in, I'll rebase all the stuff > I've got on top it and go from there... <nod> I might reformulate the pkill code to use nsexec (and not systemd) if it's available; systemd scopes if those are available (I figured out how to get systemd to tell me the cgroup name); or worst case fall back to process sessions if neither are available. I don't know how ancient of a userspace we realistically have to support since (afaict) namespaces and systemd both showed up around the 2.6.24 era? But I also don't know if Devuan at least does pid/mount namespaces. --D > > I also tried using systemd-nspawn to run fstests inside a "container" > > but very quickly discovered that you can't pass block devices to the > > container which makes fstests pretty useless for testing real scsi > > devices. :/ > > Yet another dead-end in the poorly sign-posted rat-hole, eh? Yup. --D > -Dave. > -- > Dave Chinner > david@fromorbit.com
diff --git a/check b/check index 607d2456e6a1fe..bafe48bf12db67 100755 --- a/check +++ b/check @@ -698,18 +698,41 @@ _adjust_oom_score -500 # systemd doesn't automatically remove transient scopes that fail to terminate # when systemd tells them to terminate (e.g. programs stuck in D state when # systemd sends SIGKILL), so we use reset-failed to tear down the scope. +# +# Use setsid to run the test program with a separate session id so that we +# can pkill only the processes started by this test. _run_seq() { - local cmd=(bash -c "test -w ${OOM_SCORE_ADJ} && echo 250 > ${OOM_SCORE_ADJ}; exec ./$seq") + local cmd=(bash -c "test -w ${OOM_SCORE_ADJ} && echo 250 > ${OOM_SCORE_ADJ}; exec setsid bash ./$seq") if [ -n "${HAVE_SYSTEMD_SCOPES}" ]; then local unit="$(systemd-escape "fs$seq").scope" systemctl reset-failed "${unit}" &> /dev/null - systemd-run --quiet --unit "${unit}" --scope "${cmd[@]}" + systemd-run --quiet --unit "${unit}" --scope "${cmd[@]}" & + CHILDPID=$! + wait res=$? + unset CHILDPID systemctl stop "${unit}" &> /dev/null return "${res}" else - "${cmd[@]}" + # bash won't run the SIGINT trap handler while there are + # foreground children in a separate session, so we must run + # the test in the background and wait for it. + "${cmd[@]}" & + CHILDPID=$! + wait + unset CHILDPID + fi +} + +_kill_seq() { + if [ -n "$CHILDPID" ]; then + # SIGPIPE will kill all the children (including fsstress) + # without bash logging fatal signal termination messages to the + # console + pkill -PIPE --session "$CHILDPID" + wait + unset CHILDPID fi } @@ -718,9 +741,9 @@ _prepare_test_list fstests_start_time="$(date +"%F %T")" if $OPTIONS_HAVE_SECTIONS; then - trap "_summary; exit \$status" 0 1 2 3 15 + trap "_kill_seq; _summary; exit \$status" 0 1 2 3 15 else - trap "_wrapup; exit \$status" 0 1 2 3 15 + trap "_kill_seq; _wrapup; exit \$status" 0 1 2 3 15 fi function run_section() diff --git a/common/fuzzy b/common/fuzzy index 0a2d91542b561e..772ce7ddcff6d8 100644 --- a/common/fuzzy +++ b/common/fuzzy @@ -891,7 +891,7 @@ __stress_xfs_scrub_loop() { local runningfile="$2" local scrub_startat="$3" shift; shift; shift - local sigint_ret="$(( $(kill -l SIGINT) + 128 ))" + local signal_ret="$(( $(kill -l SIGPIPE) + 128 ))" local scrublog="$tmp.scrub" while __stress_scrub_running "$scrub_startat" "$runningfile"; do @@ -901,8 +901,8 @@ __stress_xfs_scrub_loop() { while __stress_scrub_running "$end" "$runningfile"; do _scratch_scrub "$@" &> $scrublog res=$? - if [ "$res" -eq "$sigint_ret" ]; then - # Ignore SIGINT because the cleanup function sends + if [ "$res" -eq "$signal_ret" ]; then + # Ignore SIGPIPE because the cleanup function sends # that to terminate xfs_scrub res=0 fi @@ -1173,13 +1173,11 @@ _scratch_xfs_stress_scrub_cleanup() { rm -f "$runningfile" echo "Cleaning up scrub stress run at $(date)" >> $seqres.full - # Send SIGINT so that bash won't print a 'Terminated' message that - # distorts the golden output. echo "Killing stressor processes at $(date)" >> $seqres.full - _kill_fsstress - pkill -INT --parent $$ xfs_io >> $seqres.full 2>&1 - pkill -INT --parent $$ fsx >> $seqres.full 2>&1 - pkill -INT --parent $$ xfs_scrub >> $seqres.full 2>&1 + _pkill --echo -PIPE fsstress >> $seqres.full 2>&1 + _pkill --echo -PIPE xfs_io >> $seqres.full 2>&1 + _pkill --echo -PIPE fsx >> $seqres.full 2>&1 + _pkill --echo -PIPE xfs_scrub >> $seqres.full 2>&1 # Tests are not allowed to exit with the scratch fs frozen. If we # started a fs freeze/thaw background loop, wait for that loop to exit @@ -1209,6 +1207,7 @@ _scratch_xfs_stress_scrub_cleanup() { # Wait for the remaining children to exit. echo "Waiting for children to exit at $(date)" >> $seqres.full wait + echo "Children exited as of $(date)" >> $seqres.full # Ensure the scratch fs is also writable before we exit. if [ -n "$__SCRUB_STRESS_REMOUNT_LOOP" ]; then diff --git a/common/rc b/common/rc index 459be11c11c405..d143ba36265c6c 100644 --- a/common/rc +++ b/common/rc @@ -30,6 +30,12 @@ _test_sync() _sync_fs $TEST_DIR } +# Kill only the test processes started by this test +_pkill() +{ + pkill --session 0 "$@" +} + # Common execution handling for fsstress invocation. # # We need per-test fsstress binaries because of the way fsstress forks and @@ -69,7 +75,7 @@ _kill_fsstress() if [ -n "$_FSSTRESS_PID" ]; then # use SIGPIPE to avoid "Killed" messages from bash echo "killing $_FSSTRESS_BIN" >> $seqres.full - pkill -PIPE $_FSSTRESS_BIN >> $seqres.full 2>&1 + _pkill -PIPE $_FSSTRESS_BIN >> $seqres.full 2>&1 _wait_for_fsstress return $? fi @@ -2740,9 +2746,11 @@ _require_user_exists() [ "$?" == "0" ] || _notrun "$user user not defined." } +# Run all non-root processes in the same session as the root. Believe it or +# not, passing $SHELL in this manner works both for "su" and "su -c cmd". _su() { - su "$@" + su --session-command $SHELL "$@" } # check if a user exists and is able to execute commands. diff --git a/tests/generic/310 b/tests/generic/310 index 52babfdc803a21..570cc5f3859548 100755 --- a/tests/generic/310 +++ b/tests/generic/310 @@ -29,7 +29,7 @@ _begin_fstest auto # Override the default cleanup function. _cleanup() { - pkill -9 $seq.t_readdir > /dev/null 2>&1 + _pkill -9 $seq.t_readdir > /dev/null 2>&1 wait rm -rf $TEST_DIR/tmp rm -f $tmp.* @@ -83,7 +83,7 @@ _test_read() { $TEST_DIR/$seq.t_readdir_1 $SEQ_DIR > /dev/null 2>&1 & sleep $RUN_TIME - pkill -PIPE $seq.t_readdir_1 + _pkill -PIPE $seq.t_readdir_1 wait check_kernel_bug @@ -97,7 +97,7 @@ _test_lseek() $TEST_DIR/$seq.t_readdir_2 $SEQ_DIR > /dev/null 2>&1 & readdir_pid=$! sleep $RUN_TIME - pkill -PIPE $seq.t_readdir_2 + _pkill -PIPE $seq.t_readdir_2 wait check_kernel_bug diff --git a/tests/generic/561 b/tests/generic/561 index afe727ac56cbd9..b260aaf16c9256 100755 --- a/tests/generic/561 +++ b/tests/generic/561 @@ -40,7 +40,7 @@ function end_test() # stop duperemove running if [ -e $dupe_run ]; then rm -f $dupe_run - pkill $dedup_bin >/dev/null 2>&1 + _pkill $dedup_bin >/dev/null 2>&1 wait $dedup_pids rm -f $dedup_prog fi