Message ID | 20241205002624.3420504-1-mcgrof@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | common/config: use modprobe -w when supported | expand |
On Wed, Dec 04, 2024 at 04:26:24PM -0800, Luis Chamberlain wrote: > -modprobe --help >& /dev/null && modprobe --help 2>&1 | grep -q -1 "remove-patiently" > +modprobe --help >& /dev/null && modprobe --help 2>&1 | grep -q -1 "wait TIMEOUT_MSEC" Silly me, this is wrong, this should be: +modprobe --help >& /dev/null && modprobe --help 2>&1 | grep -q -1 "wait up to MSEC" Luis
On 12/4/24 6:26 PM, Luis Chamberlain wrote: > Eric, I saw you mentioning on IRC you didn't understand *why* > the patient module remover was added. Even though I thought the > commit log explained it, let me summarize again: fix tons of > flaky tests which assume module removal is being done correctly. > It is not and fixing this is a module specific issue like with > scsi_debug as documented in the commit log bugzilla references. > So any sane test suite thing relying on module removal should use > something like modprobe -w <timeout-in-ms>. Well, I was having a sad because the upshot of all of that was that when xfs was not removable at all because it was the root filesystem, the end result was something like this: --- tests/xfs/435.out 2024-11-21 05:13:04.000000000 -0500 +++ /var/lib/xfstests/results//xfs/435.out.bad 2024-11-21 05:14:47.949206141 -0500 @@ -3,3 +3,4 @@ Create a many-block file Remount to check recovery See if we leak +custom patient module removal for xfs timed out waiting for refcnt to become 0 using timeout of 50 which is kind of nonsense, but that probably has more to do with the test not realizing /before it starts/ that the module cannot be removed and it should not even try. Darrick fixed that with: [PATCH 2/2] xfs/43[4-6]: implement impatient module reloading but it's starting to feel like a bit of a complex house of cards by now. We might need a more robust framework for determining whether a module is removable /at all/ before we decide to wait patiently for a thing that cannot ever happen? -Eric
On 12/5/2024 1:26 AM, Luis Chamberlain wrote: > We had added support for open coding a patient module remover long > ago on fstests through commit d405c21d40aa1 ("common/module: add patient > module rmmod support") to fix many flaky tests. This assumed we'd end up > with modprobe -p -t <msec-timeout> but in the end kmod upstream just I can't find modprobe -p and/or -t arguments in the manual. What do they mean? Well, I'm now seeing this "patient" support: https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/commit/?id=d405c21d40aa1f0ca846dd144a1a7731e55679b2 but i can't find the module remover support in kmod. Nit. I find useful using the long argument instead of the short one (e.g. --wait instead of -w). as it's usually self-descriptive. But I guess we don't have that long option for -p and -t? Daniel > used modprobe -w <msec-timeout> through the respective kmod commit > 2b98ed888614 ("modprobe: Add --wait"). > > Take advantage of the upstream patient module remover support added > since June 2022, so many distributions should already have support for > this now. > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > --- > > Eric, I saw you mentioning on IRC you didn't understand *why* > the patient module remover was added. Even though I thought the > commit log explained it, let me summarize again: fix tons of > flaky tests which assume module removal is being done correctly. > It is not and fixing this is a module specific issue like with > scsi_debug as documented in the commit log bugzilla references. > So any sane test suite thing relying on module removal should use > something like modprobe -w <timeout-in-ms>. > > Luis > > common/config | 12 ++++++------ > 1 file changed, 6 insertions(+), 6 deletions(-) > > diff --git a/common/config b/common/config > index fcff0660b05a..d899129fd5f1 100644 > --- a/common/config > +++ b/common/config > @@ -264,7 +264,7 @@ export UDEV_SETTLE_PROG > # Set MODPROBE_PATIENT_RM_TIMEOUT_SECONDS to "forever" if you want the patient > # modprobe removal to run forever trying to remove a module. > MODPROBE_REMOVE_PATIENT="" > -modprobe --help >& /dev/null && modprobe --help 2>&1 | grep -q -1 "remove-patiently" > +modprobe --help >& /dev/null && modprobe --help 2>&1 | grep -q -1 "wait TIMEOUT_MSEC" > if [[ $? -ne 0 ]]; then > if [[ -z "$MODPROBE_PATIENT_RM_TIMEOUT_SECONDS" ]]; then > # We will open code our own implementation of patient module > @@ -276,19 +276,19 @@ else > if [[ ! -z "$MODPROBE_PATIENT_RM_TIMEOUT_SECONDS" ]]; then > if [[ "$MODPROBE_PATIENT_RM_TIMEOUT_SECONDS" != "forever" ]]; then > MODPROBE_PATIENT_RM_TIMEOUT_MS="$((MODPROBE_PATIENT_RM_TIMEOUT_SECONDS * 1000))" > - MODPROBE_RM_PATIENT_TIMEOUT_ARGS="-t $MODPROBE_PATIENT_RM_TIMEOUT_MS" > + MODPROBE_RM_PATIENT_TIMEOUT_ARGS="-w $MODPROBE_PATIENT_RM_TIMEOUT_MS" > fi > else > # We export MODPROBE_PATIENT_RM_TIMEOUT_SECONDS here for parity > - # with environments without support for modprobe -p, but we > + # with environments without support for modprobe -w, but we > # only really need it exported right now for environments which > - # don't have support for modprobe -p to implement our own > + # don't have support for modprobe -w to implement our own > # patient module removal support within fstests. > export MODPROBE_PATIENT_RM_TIMEOUT_SECONDS="50" > MODPROBE_PATIENT_RM_TIMEOUT_MS="$((MODPROBE_PATIENT_RM_TIMEOUT_SECONDS * 1000))" > - MODPROBE_RM_PATIENT_TIMEOUT_ARGS="-t $MODPROBE_PATIENT_RM_TIMEOUT_MS" > + MODPROBE_RM_PATIENT_TIMEOUT_ARGS="-w $MODPROBE_PATIENT_RM_TIMEOUT_MS" > fi > - MODPROBE_REMOVE_PATIENT="modprobe -p $MODPROBE_RM_PATIENT_TIMEOUT_ARGS" > + MODPROBE_REMOVE_PATIENT="modprobe $MODPROBE_RM_PATIENT_TIMEOUT_ARGS" > fi > export MODPROBE_REMOVE_PATIENT >
On Thu, Dec 05, 2024 at 08:41:22AM +0100, Daniel Gomez wrote: > On 12/5/2024 1:26 AM, Luis Chamberlain wrote: > > We had added support for open coding a patient module remover long > > ago on fstests through commit d405c21d40aa1 ("common/module: add patient > > module rmmod support") to fix many flaky tests. This assumed we'd end up > > with modprobe -p -t <msec-timeout> but in the end kmod upstream just > > I can't find modprobe -p and/or -t arguments in the manual. What do they > mean? I had proposed -p to mean patient module remover with a default timeout set, -t to override. In the end this went upstream instead over a lot of dialog with just -w <timeout>. > but i can't find the module remover support in kmod. > > > Nit. I find useful using the long argument instead of the short one (e.g. > --wait instead of -w). as it's usually self-descriptive. But I guess we > don't have that long option for -p and -t? It was one or the other that went upstream, we implemented this on fstests upstream as an open coded solution while we wanted for this meachanism to be agreed upon and merged. I just forgot to come back to this after -w was merged and decided upon. Luis
On Wed, Dec 04, 2024 at 10:35:45PM -0600, Eric Sandeen wrote: > but that probably has more to do with the test not realizing > /before it starts/ that the module cannot be removed and it > should not even try. Right. > Darrick fixed that with: > > [PATCH 2/2] xfs/43[4-6]: implement impatient module reloading Looks good to me. > but it's starting to feel like a bit of a complex house of cards > by now. We might need a more robust framework for determining whether > a module is removable /at all/ before we decide to wait patiently > for a thing that cannot ever happen? I think the above is a good example of knowing userspace and knowing that userspace may be doing something else and we're ok to fail. Essentially, module removal is non-deterministic due to how finicky and easy it is to bump the refcnt for arbitrary reasons which are subsystem specific. The URLs in the commit log I added provide good examples of this. It is up to each subsystem to ensure a proper quiesce makes sense to ensure userspace won't do something stupid later. If one can control the test environment to quiesce first, then it makes sense to patiently remove the module. Otherwise the optional impatient removal makes sense. Luis
On 12/5/2024 11:21 PM, Luis Chamberlain wrote: > On Thu, Dec 05, 2024 at 08:41:22AM +0100, Daniel Gomez wrote: >> On 12/5/2024 1:26 AM, Luis Chamberlain wrote: >>> We had added support for open coding a patient module remover long >>> ago on fstests through commit d405c21d40aa1 ("common/module: add patient >>> module rmmod support") to fix many flaky tests. This assumed we'd end up >>> with modprobe -p -t <msec-timeout> but in the end kmod upstream just >> >> I can't find modprobe -p and/or -t arguments in the manual. What do they >> mean? > > I had proposed -p to mean patient module remover with a default timeout > set, -t to override. In the end this went upstream instead over a lot of > dialog with just -w <timeout>. > >> but i can't find the module remover support in kmod. >> >> >> Nit. I find useful using the long argument instead of the short one (e.g. >> --wait instead of -w). as it's usually self-descriptive. But I guess we >> don't have that long option for -p and -t? > > It was one or the other that went upstream, we implemented this on > fstests upstream as an open coded solution while we wanted for this > meachanism to be agreed upon and merged. I just forgot to come back to > this after -w was merged and decided upon. Thanks for clarifying. > > Luis
On 12/5/24 4:52 PM, Luis Chamberlain wrote: > On Wed, Dec 04, 2024 at 10:35:45PM -0600, Eric Sandeen wrote: >> but that probably has more to do with the test not realizing >> /before it starts/ that the module cannot be removed and it >> should not even try. > > Right. > >> Darrick fixed that with: >> >> [PATCH 2/2] xfs/43[4-6]: implement impatient module reloading > > Looks good to me. > >> but it's starting to feel like a bit of a complex house of cards >> by now. We might need a more robust framework for determining whether >> a module is removable /at all/ before we decide to wait patiently >> for a thing that cannot ever happen? > > I think the above is a good example of knowing userspace and knowing > that userspace may be doing something else and we're ok to fail. > Essentially, module removal is non-deterministic due to how finicky > and easy it is to bump the refcnt for arbitrary reasons which are > subsystem specific. The URLs in the commit log I added provide good > examples of this. It is up to each subsystem to ensure a proper > quiesce makes sense to ensure userspace won't do something stupid > later. > > If one can control the test environment to quiesce first, then it > makes sense to patiently remove the module. Otherwise the optional > impatient removal makes sense. Not to belabor the point too much, but my gut feeling is there are cases where "quiescing" is not the issue at all - if the module is in use on the system somewhere outside of xfstests, no amount of quiescing or waiting will make it removable. Essentially, xfstests needs to figure out if it is the sole owner/user of a module before it tries to do any sort of waiting for removal, IMHO. I haven't had the time to think through it much, just my spidey- sense tingling when looking at the xfs/43[4-6] problem and feeling like there's probably a better/cleaner/more explicit solution for some of this. Thanks, -Eric > Luis >
On Fri, Dec 06, 2024 at 10:38:11AM -0600, Eric Sandeen wrote: > On 12/5/24 4:52 PM, Luis Chamberlain wrote: > > On Wed, Dec 04, 2024 at 10:35:45PM -0600, Eric Sandeen wrote: > >> but that probably has more to do with the test not realizing > >> /before it starts/ that the module cannot be removed and it > >> should not even try. > > > > Right. > > > >> Darrick fixed that with: > >> > >> [PATCH 2/2] xfs/43[4-6]: implement impatient module reloading > > > > Looks good to me. > > > >> but it's starting to feel like a bit of a complex house of cards > >> by now. We might need a more robust framework for determining whether > >> a module is removable /at all/ before we decide to wait patiently > >> for a thing that cannot ever happen? > > > > I think the above is a good example of knowing userspace and knowing > > that userspace may be doing something else and we're ok to fail. > > Essentially, module removal is non-deterministic due to how finicky > > and easy it is to bump the refcnt for arbitrary reasons which are > > subsystem specific. The URLs in the commit log I added provide good > > examples of this. It is up to each subsystem to ensure a proper > > quiesce makes sense to ensure userspace won't do something stupid > > later. > > > > If one can control the test environment to quiesce first, then it > > makes sense to patiently remove the module. Otherwise the optional > > impatient removal makes sense. > > Not to belabor the point too much, but my gut feeling is there are > cases where "quiescing" is not the issue at all - if the module is > in use on the system somewhere outside of xfstests, no amount of > quiescing or waiting will make it removable. Yes indeed, that is a good point. Only if the test suite has full control to ensure the lifetime of the module could it rely on removal. But there are holes in the assumptions which can be made even on this front too. I'll explain below. > Essentially, xfstests > needs to figure out if it is the sole owner/user of a module before > it tries to do any sort of waiting for removal, IMHO. Even if it could do that, it can't prevent the user from poking around, or for some userspace package to not be present which may proactively poke around block devices of certain type for example. Such things will break the assumptions that the test suite has full control. Test runners, which do full bringup / setup / package installations have more control for a more deterministic setup, fstests itself however will have most control, but since this is about flaky tests failing for stupid reasons, this is all about *improving* from a test perspective consciosness over this problem and for the tester to make more appropriate calls for what it thinks it can have control over. Luis
diff --git a/common/config b/common/config index fcff0660b05a..d899129fd5f1 100644 --- a/common/config +++ b/common/config @@ -264,7 +264,7 @@ export UDEV_SETTLE_PROG # Set MODPROBE_PATIENT_RM_TIMEOUT_SECONDS to "forever" if you want the patient # modprobe removal to run forever trying to remove a module. MODPROBE_REMOVE_PATIENT="" -modprobe --help >& /dev/null && modprobe --help 2>&1 | grep -q -1 "remove-patiently" +modprobe --help >& /dev/null && modprobe --help 2>&1 | grep -q -1 "wait TIMEOUT_MSEC" if [[ $? -ne 0 ]]; then if [[ -z "$MODPROBE_PATIENT_RM_TIMEOUT_SECONDS" ]]; then # We will open code our own implementation of patient module @@ -276,19 +276,19 @@ else if [[ ! -z "$MODPROBE_PATIENT_RM_TIMEOUT_SECONDS" ]]; then if [[ "$MODPROBE_PATIENT_RM_TIMEOUT_SECONDS" != "forever" ]]; then MODPROBE_PATIENT_RM_TIMEOUT_MS="$((MODPROBE_PATIENT_RM_TIMEOUT_SECONDS * 1000))" - MODPROBE_RM_PATIENT_TIMEOUT_ARGS="-t $MODPROBE_PATIENT_RM_TIMEOUT_MS" + MODPROBE_RM_PATIENT_TIMEOUT_ARGS="-w $MODPROBE_PATIENT_RM_TIMEOUT_MS" fi else # We export MODPROBE_PATIENT_RM_TIMEOUT_SECONDS here for parity - # with environments without support for modprobe -p, but we + # with environments without support for modprobe -w, but we # only really need it exported right now for environments which - # don't have support for modprobe -p to implement our own + # don't have support for modprobe -w to implement our own # patient module removal support within fstests. export MODPROBE_PATIENT_RM_TIMEOUT_SECONDS="50" MODPROBE_PATIENT_RM_TIMEOUT_MS="$((MODPROBE_PATIENT_RM_TIMEOUT_SECONDS * 1000))" - MODPROBE_RM_PATIENT_TIMEOUT_ARGS="-t $MODPROBE_PATIENT_RM_TIMEOUT_MS" + MODPROBE_RM_PATIENT_TIMEOUT_ARGS="-w $MODPROBE_PATIENT_RM_TIMEOUT_MS" fi - MODPROBE_REMOVE_PATIENT="modprobe -p $MODPROBE_RM_PATIENT_TIMEOUT_ARGS" + MODPROBE_REMOVE_PATIENT="modprobe $MODPROBE_RM_PATIENT_TIMEOUT_ARGS" fi export MODPROBE_REMOVE_PATIENT
We had added support for open coding a patient module remover long ago on fstests through commit d405c21d40aa1 ("common/module: add patient module rmmod support") to fix many flaky tests. This assumed we'd end up with modprobe -p -t <msec-timeout> but in the end kmod upstream just used modprobe -w <msec-timeout> through the respective kmod commit 2b98ed888614 ("modprobe: Add --wait"). Take advantage of the upstream patient module remover support added since June 2022, so many distributions should already have support for this now. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> --- Eric, I saw you mentioning on IRC you didn't understand *why* the patient module remover was added. Even though I thought the commit log explained it, let me summarize again: fix tons of flaky tests which assume module removal is being done correctly. It is not and fixing this is a module specific issue like with scsi_debug as documented in the commit log bugzilla references. So any sane test suite thing relying on module removal should use something like modprobe -w <timeout-in-ms>. Luis common/config | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)